> I'd invite you to look at ontologies as nothing more than representations of things we know in some text-based format.
That's because we know how to interpret the concepts used in these representations, in relation to each other. It's just a syntactic change.
You might have a point if it's used as a kind of search engine: "show me wikipedia articles where X causes Y?" (although there is at least one source besides wikipedia, but you get my drift).
> Aside from my point above - haven't looked at the source data, but I doubt it stops at that level.
It does. It isn't even a triple, it's a pair: (cause, effect). There's no other relation than "causes". And if I skimmed the article correctly, they just take noun phrases and slap an underscore between the words and call it a concept. There's no meaning attached to the labels.
But the higher-order causations you mention are going to be pretty useless if there's no way on how to interpret them. It'll only work for highly specialized, unambiguous concepts, like myxomatosis (which is akin to encoding knowledge in the labels themselves), and the broad nature of many of the concepts will lead to quickly decaying usefulness when the length of the path increases. Here are some random examples (length 4 and 8, no posterior selection) from their "precision" set (197k pairs):
['mistake', 'deaths', 'riots', 'violence']
['higher_operating_income', 'increase_in_operating_income', 'increase_in_net_income', 'increase']
['mail_delivery', 'delays', 'decline_in_revenue', 'decrease']
['wastewater', 'environmental_problems', 'problems', 'treatment']
['sensor', 'alarm', 'alarm', 'alarm']
['thatch', 'problems', 'cost_overruns', 'project_delays']
['smoking_pot', 'lung_cancer', 'shortness_of_breath', 'conditions']
['older_medications', 'side_effects', 'physical_damage', 'loss']
['less_fat', 'weight_loss', 'death', 'uncertainties']
['diesel_particles', 'cancer', 'damages', 'injuries']
['malfunction_in_the_heating_unit', 'fire', 'fire_damage', 'claims']
['drug-resistant_malaria', 'deaths', 'violence', 'extreme_poverty']
['fairness_in_circumstances', 'stress', 'backache', 'aching_muscles']
['curved_spine', 'back_pain', 'difficulties', 'stress', 'difficulties', 'delay', 'problem', 'serious_complications']
['obama', 'high_gas_prices', 'recession', 'hardship', 'happiness', 'success', 'promotions', 'bonuses']
['financial_devastation', 'bankruptcy', 'stigma', 'homelessness', 'health_problems', 'deaths', 'pain', 'quality_of_life']
['methylmercury', 'neurological_damage', 'seizures', 'changes', 'crisis', 'growth', 'problems', 'birth_defects']
The latter is probably correct, but the chain of reasoning is false...
This one is cherry-picked, but I found it to funny to omit:
['agnosticism', 'despair', 'feelings', 'aggression', 'action', 'riot', 'arrest', 'embarrassment', 'problems', 'black_holes']
> There's no other relation than "causes".
Looking at their Neo4j graph, they also retain the provenance of the causal relation in "claimedIn" relations between the reified triple of each cause-effect pair. So, that's at least marginally useful for fact-checking or quality evals.
> they just take noun phrases and slap an underscore between the words and call it a concept.
Not to defend
lazy approaches, but you could make this point about tokens also ("take any bunch of characters that happens often enough, and call it a token").
> It'll only work for highly specialized, unambiguous concepts
Fair point. Practically, there's not much use in this unless you really dedicate time to figure out what's meant by each concept, and prune junk. And by that time, you may as well make your own pairs.
But from a research POV, hardly anyone will go through such effort, so I still find it quite useful. Some potential questions that I could derive from this (aside from ~110 works citing it since 2020, which is not bad for KRR work):
- What is the quality of causal relations (e.g., diagnostic decision trees) in medical articles?
- How can the original scientific provenance of cause-effect pairs best be represented?
- Which extracted causes are true variables in some effect, and to what extent/direction? (i.e., an ablation study at scale, provided you can find appropriate data)
True.
I think we look at it from different sides. Mine is "how good will this be in independently running code" (where 80% correctness is minimally needed), yours seems to be more "how well does this represent our knowledge" (from different angles).
> you could make this point about tokens also ("take any bunch of characters that happens often enough, and call it a token").
My reason was more that such an approach doesn't work well with (unrestricted natural language) text. E.g. side_effects => physical_damage: What side effects? (Why plural?) Not all cause physical damage. And not all side effects that cause physical damage cause the same damage. The differences are described elsewhere in the text, but not consistent enough to extend the token with that information, so just associating literal excerpts from a text will practically guarantee underspecification (except for practically unambiguous term). The effectiveness will be language dependent, of course.
Anyway...