RDF-star patterns for provenance
Posted on:The RDF-star specification was published as a final community group report in December 2021. For a little more than a year, some participants of the RDF-DEV community group have joined forces to provide a consolidated description of RDF-star and an associated test suite. The goal is to foster the convergence of existing implementations, and the emergence of interoperable new ones.
The development of RDF-star has been surrounded with a lot of enthusiasm and expectations, which is both a blessing and a curse. Many people, with different backgrounds and needs, seem to expect RDF-star to be the perfect and straightforward solution to their specific problem. To enable multiple use cases the group has striven to make RDF-star a generic enough toolbox, out of which most use-cases can be solved – while sometimes requiring the user to go the extra mile.
In this post, we present some lessons learned by the group through discussions and exchanges. This is meant to give some insight about the rationale behind RDF-star, and some guidelines about how to best use it for modeling provenance data.
There’s only so much a quoted triple can carry
The first example of the RDF-star specification is repeated below:
PREFIX : <http://www.example.org/>
:employee38 :familyName "Smith" .
<< :employee38 :jobTitle "Assistant Designer" >> :accordingTo :employee22 .
The intended meaning of this small RDF-star graph is: “employee #38 is named Smith, and employee #22 claims that employee #38 is an assistant designer”. This example illustrates, in particular, how a quoted triple (between double angle brackets) can be used (here, as the subject of another triple) without being asserted: we (the authors of the graph) do not endorse the claim made by employee #22. By quoting, we are referring to the triple without making the triple itself part of the graph.
This example could be extended as follows:
PREFIX : <http://www.example.org/>
<< :employee38 :jobTitle "Assistant Designer" >>
:accordingTo :employee22, :employee38 ;
:confidence 0.8 .
In this new example, both employee #22 and employee #38 are making an identical claim, still not endorsed by us. Furthermore, we assign a confidence score to the statement that employee #38’s job title is “Assistant Designer”.
To illustrate how this kind of modeling could be useful, imagine an RDF store containing a collection of claims, described as above with claimers and confidence level. The following SPARQL-star query could be used to retrieve, for each claimer, the minimum confidence we have in the statements they claimed about themselves.
PREFIX : <http://www.example.org/>
SELECT ?claimer (MIN(?conf) as ?minConfidence)
{
<< ?claimer ?p ?o >> :accordingTo ?claimer; :confidence ?conf
}
GROUP BY ?claimer
It is however important to understand that this basic design has limitations. Namely, each statement made about a particular triple must be interpretable independently of the other statements made about that triple. (This is actually a general feature of RDF, not just RDF-star: two statements about the same subject must always be interpretable independently from each other. On the open web, if we assume that another triple that we have not yet discovered could change the meaning of the triples that we know, then reasoning with what we know would become much more hazardous.)
Therefore, while it could be tempting to extend the examples above as follows, it would be a bad design, as we will show.
# ⚠ YOU MUST NOT DO THIS
PREFIX : <http://www.example.org/>
<< :employee38 :jobTitle "Assistant Designer" >>
:accordingTo :employee22; :confidence 0.2 .
# we don’t trust employee22 about someone else’s job title
<< :employee38 :jobTitle "Assistant Designer" >>
:accordingTo :employee38; :confidence 0.8 .
# we quite trust employee38 about their own job title
First, note that the example above changes the meaning of the :confidence
predicate. It is not used anymore to represent the general confidence we have in the triple itself, but to represent the confidence that we have in a particular person claiming the triple. If we were to use an actual ontology, those two different notions of “confidence” would require two distinct IRIs.
But most importantly, the problem with the example above is that it does not accurately capture the intended meaning, because it is equivalent to:
PREFIX : <http://www.example.org/>
<< :employee38 :jobTitle "Assistant Designer" >>
:accordingTo :employee22;
:accordingTo :employee38;
:confidence 0.2;
:confidence 0.8 .
The four triples asserted by this graph have the same subject (namely, the quoted triple << :employee38 :jobTitle “Assistant Designer” >>
), and there is no way to know which claimer is associated to which confidence score.
This contrasts RDF-star with (some implementations of) Property Graphs, which allow multiple identical edges to co-exist between two nodes, and to carry different properties. Note that this “impedance mismatch” has been recognized as early as 2014, but that some solutions were already envisioned then.
More complex provenance modeling
The problem with the last example above is that we are not talking about the triple << :employee38 :jobTitle “Assistant Designer” >>
(which is uniquely identified by its subject, predicate and object). We want to talk about two similar but distinct claims, each claim with its own identity, and its own properties. Let us introduce a new property linking a given triple to one or several of its claims. A correct version of the previous example would now be:
PREFIX : <http://www.example.org/>
<< :employee38 :jobTitle "Assistant Designer" >> :hasClaim <#c1>, <#c2>.
<#c1> :claimer :employee22; :claimConfidence 0.2 .
<#c2> :claimer :employee38; :claimConfidence 0.8 .
As an autonomous entity, each claim can have any number of properties that will no longer be confused with the properties of other claims of the same triple. We could for instance extend this example by adding to each claim a date, a source document…
With such a design, the SPARQL-star query above needs to be updated, and would become:
PREFIX : <http://www.example.org/>
SELECT ?claimer (MIN(?conf) as ?minConfidence)
{
<< ?claimer ?p ?o >> :hasClaim [
:claimer ?claimer; :claimConfidence ?conf
]
}
GROUP BY ?claimer
Epilogue
Note that it could be argued that we have always been talking about claims, even in the two first examples of this post, and so that these two examples were badly designed and should have used the :hasClaim property as well. We argue that the design of the first two examples is sufficient, when the properties recorded about claims are simple enough. A balance always has to be found between, on the one hand, simplicity and usability, and on the other hand, purity and scalability. Following George Box’s aphorism that “all models are wrong, but some are useful”, we consider that the design of the first two examples is useful enough in some situations.
Acknowledgement
Thanks to the members of the RDF-star group for their reviews and feedback on this post.