Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dereferenceable identifiers [RDID] #53

Closed
jpullmann opened this issue Jan 18, 2018 · 43 comments
Closed

Dereferenceable identifiers [RDID] #53

jpullmann opened this issue Jan 18, 2018 · 43 comments

Comments

@jpullmann
Copy link

Dereferenceable identifiers [RDID]

Encode identifiers as dereferenceable HTTP URIs


Related use cases: Modeling identifiers and making them actionable [ID11] 
@makxdekkers
Copy link
Contributor

I agree but I don't think there is a need to make any changes in the DCAT specification. It's an automatic consequence of DCAT being an RDF vocabulary.

@dr-shorthair
Copy link
Contributor

Yes - that was my impression too. No changes necessary to satisfy this requirement.

@andrea-perego
Copy link
Contributor

andrea-perego commented Jan 19, 2018

This issue is strictly related to providing guidance on how to use DCAT to specify identifiers as DOIs, ISBNs, etc. - see the related use case.
Currently, these IDs are encoded as simple strings, unless they are used as part of the primary resource URI. An option could be to encourage the use of owl:sameAs whenever the ID can be resolvable when encoded as URI (as for DOIs).
So, there may be no need to create new property / classes, but rather to describe how to use the existing ones to address these use cases.

@makxdekkers
Copy link
Contributor

makxdekkers commented Jan 19, 2018

It could be part of DCAT guidance, maybe in the usage note of https://www.w3.org/TR/vocab-dcat/#Property:dataset_identifier? In fact, the current usage note only suggests that the "identifier might be used as part of the URI of the dataset" but it would be good to mention other identifiers in the usage note as well.

@andrea-perego
Copy link
Contributor

andrea-perego commented Jan 19, 2018

It could be part of DCAT guidance, maybe in the usage note of https://www.w3.org/TR/vocab-dcat/#Property:dataset_identifier? In fact, the current usage note only suggests that the "identifier might be used as part of the URI of the dataset" but it would be good to mention other identifiers in the usage note as well.

+1 from me. One of options mentioned in the related use case is to use dct:identifier with a datatype denoting the identifier type (DOI, etc.). But these datatypes need to be defined. There's of course also the other option of using specific properties for each type of identifier (prism:doi, bibo:doi, etc.).
But for specifying multiple identifiers as HTTP URIs we need a property as owl:sameAs, which needs to be added to the DCAT spec.

@agbeltran
Copy link
Member

Should we add the 'documentation' tag for this requirement then?

@kcoyle
Copy link
Contributor

kcoyle commented Jan 24, 2018

The library world has struggled with this same problem. There are many identifiers that are not (yet) expressed as IRIs. As these are just alpha-numeric strings, there is a need to give a context so that they are meaningful/useful. This has led to some awkward models of identifiers being at least 2-part: the identifier string, and the "provenance" of the identifier. So although one should prefer IRI forms when available, what should be done with a string like "098378297" when it is the identifier from some agency? That's the hard part.

@agbeltran
Copy link
Member

@kcoyle - could you provide a pointer to a catalogue from the library world with that situation? In those examples, is it not possible to get a description of the resource being identified at all?

For the case of life science data, which would be also applicable to other scientific domains I imagine, our paper "Identifiers for the 21st century: how to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data"( https://doi.org/10.1371/journal.pbio.2001414) presents the situation and could be a useful reference.

@kcoyle
Copy link
Contributor

kcoyle commented Feb 1, 2018

Here are some (there are about a dozen common ones) (type followed by example):

  • US Copyright Office document ID: PA 1-060-815
  • ISBN: 9780060723804 (also ISSN for serials is similar)
  • Patent document number (requires country code): 67-SC41534
  • Standard techical report number: METPRO/CB/TR--74/216+PR.ENVR.WI

I'm not sure what you mean about getting a description of the resource - the bibliographic record is a description of the resource; what is problematic is that not all identifiers have a URI form. These are not "web-based identifiers" and I suspect that other data providers also have older identifiers that are not (yet) web-based. These are in library data. In addition, the identifiers for the records in a data file in library data are simple strings, like "##2001627090" (the octothorps represent blanks). These are exported as URIs when the data is converted to RDF ("https://lccn.loc.gov/2015020100") but there is still a majority of data that is not in RDF.

The concept of a "data catalog" is not applied to these files of records although there are sites that provide files of the records for download. This may or may not fit into the context of DCAT.

@makxdekkers
Copy link
Contributor

makxdekkers commented Feb 7, 2018

The European DCAT-AP includes a property adms:identifier with range adms:Identifier for Dataset. adms:Identifier is based on the UN/CEFACT Identifier class and consists of:

  • a content string which is the identifier;
  • an optional identifier for the identifier scheme;
  • an optional identifier for the version of the identifier scheme;
  • an optional identifier for the agency that manages the identifier scheme.

@nicholascar
Copy link
Contributor

nicholascar commented Feb 7, 2018

Whatever mechanism we decide on for handing identifiers, it should be comprehensive enough to be used in a DCAT2 and also elsewhere since we have the requirement of referring to alternate identifiers (non-HTTP URIs) for things like physical samples in SOSA ontology catalogues.

@dr-shorthair
Copy link
Contributor

Discussed at some length in meeting https://www.w3.org/2018/02/07-dxwgdcat-minutes

AndreaPerego: main issue is that a number of other identifier systems are used for data citation, publishers, etc
... e.g. DataCite supports quite a few identifier systems
... DCAT-AP also discussed this at length
... agencies want to use their internal identifiers, not necessarily URIs
... may connect datasets using SPARQL queries, etc not just URIs
... what is needed by different communities is ability to specify different kinds of identifiers, and their type
... need to indicate that a string is an identifier
... whenenver possible make identifiers resolvable by encoding them as URIs, but this does not apply to all identifier systems
... situation is quite complicated
... there are some other URI systems, but not necessarily resolvable
... case sensitivity is also an issue
... proposal made in UC is to try to address both issue
... 1. encode as http URIs where possible
... 2. encode as a string using dct:identifier property, and note the type of the identifier using ^^type indicator
... UC is about providing guidance where standard RDF http URI does not apply
... also for how to use SPARQL queries, for example

NicholasCar: We had the same issue with physical samples.
... recommendation there is to supplement identifier field with identifier-type
... need a comprehensive schema for alternative identifiers

SimonCox: Makx suggested looking at ADMS.
... https://‌www.w3.org/‌TR/‌vocab-adms/#identifier

It would be beneficial to have a comprehensive handling of identifiers, inc. type and other properties, as we need to use them in many situations, not just DCAT

SimonCox: There's an adms:Identifier class there.

Yes, ADMS seems to mostly do it!

SimonCox: It is based on UN/CEFACT. So, this appear that fullfills the proposal you made, AndreaPerego.
... Adopt or clone adms:Identifier

AndreaPerego: alternative proposals: PRISM, BIBO,
... specific fields for well-known identifier schemes, e.g. bibo:DOI
... these are already used by some important services, e.g. crossRef
... need to explain how these different approaches map to each other

@dr-shorthair
Copy link
Contributor

Proposal: promote adms:Identifier to DCAT

@fellahst
Copy link

I suggest we clone ADMS identifier or use an ontology addressing only identifiers such as http://data.press.net/ontology/identifier/ or http://ows.usersmarts.com/owldocgen/owldoc?url=http://www.opengis.net/ont/common/identifier# . The reason to use a micro ontology for just identifiers is that it can be reused for many other purposes. It would be nice to submit this small ontology for standardization by W3C.

@nicholascar
Copy link
Contributor

I like! I think the first microbotology needs a few more things though: notes on identifier formats; whether they are structured or opaque strings etc. These could be optional

@philarcher
Copy link

On this topic, one of my priorities at GS1 now is to make barcodes dereferenceable. In more formal terms, we're defining how GTINs (the numbers you see beneath a barcode) and our other less well-known identifiers, can be encoded in HTTP URIs. I mention it here because there is a close relationship between our GTINs and ISBN and ISSN (ISBNs all begin 978 or 979 but are part of the EAN/UPC/GTIN world). Therefore, if this WG has use cases for dereferenceable ISBNs, I'd be pleased to know, especially if you have any idea where they should dereference to!

@makxdekkers
Copy link
Contributor

In fact, you could call the definition of adms:Identifier a micro-ontology: it defines a class and a set of properties to describe it, plus a note that "it may also be useful to provide further properties".

@larsgsvensson
Copy link
Contributor

Therefore, if this WG has use cases for dereferenceable ISBNs, I'd be pleased to know, especially if you have any idea where they should dereference to!

I don't have a use case, but my first reaction would be that ISBNs should dereference to the national library for the jurisdiction where the publication was published (or where the publisher is located). But maybe I'm biased since I work in a national library...

@agbeltran
Copy link
Member

Also relevant to this discussion is the schema.org discussion on identifiers and the sdo:identifier term.

@smrgeoinfo
Copy link
Contributor

after a review of the discussion, it looks like there are two proposals:
ADMS kind of approach-- identifiers have a datatype like skos:notation, i.e. typed literal, and the value for the typed literal is the identifier type. e.g.
dcat:identifier "978-3-16-148410-0"^^https://www.iso.org/standard/36563.html
Its not clear to me how ADMS would serialize the other properties (version and managing authority)

schema.org, ISO19115, DATS approach-- make identifier an object/class with a code property (the identifier string), a scheme property, maybe an authority property.

Personally I think the second approach is more transparent and widely used.
Schema.org implements the identifier as a PropertyValue, which obfuscates things;
DATS uses 'identifier' and 'identifierSource' as the property names;
ISO19115-1 uses 'code', 'codespace', and 'version', with a citation for the 'authority'
DataCite has 'identifier' and 'identifierType'

proposal:
class: dcat:identifier
Properties:
dcat:code -- the identifier string; for a well formed URI this would be all that's necessary
dcat:identifierType -- literal or URI
dcat:version -- literal
authority -- foaf:organization

@makxdekkers
Copy link
Contributor

@smrgeoinfo ADMS also makes the identifier a class, namely adms:Identifier.
The spec at https://www.w3.org/TR/vocab-adms/#identifier indeed does not provide a full recommendation on how to express the other properties of the Identifier, but I would suggest:

  • the identifier string in skos:notation
  • the identifier scheme in skos:inScheme
  • the version in owl:versionInfo
  • the agency in dct:creator or dct:publisher

I would not be in favour of defining a dcat:Identifier class alongside the adms:Identifier class that basically does the same thing.

@riccardoAlbertoni
Copy link
Contributor

riccardoAlbertoni commented Nov 15, 2018

adms:identifier is already adopted in some DCAT application profiles, so I second the idea of using it rather than introducing new terms, at least as a first attempt.

As part of the action 259 which has been assigned to me in the last week dcat call, I have drafted the following wiki page, DCAT-Identifiers.

In such a page, I have tried to set up a proposal based on existing adms:identifier examples.

The page is still in progress, I certainly need to update it with the latest @makxdekkers suggestions. Though it is not yet complete, and corrections might be needed, I guess it can help the discussion.

@smrgeoinfo
Copy link
Contributor

@riccardoAlbertoni thanks, that wiki page is helpful. A couple comments:
in the Representing HTTP dereferenceable secondary identifier section, there seems to be an assumption that the ^^xsd:anyURI type implies that the literal is an HTTP URI, but the data type allows any valid RFC-3986 URI (e.g. urn:), and these might not be dereferenceable.

Also, in the example, with a doi:

 skos:notation  "10.1109/5.771073"^^dcat:doi  ;
 adms:schemeAgency "International DOI Foundation" .

I would suggest that the issuing authority of interest should be the registrant for the 10.1109 doi space, "IEEE Xplore Digital Library", perhaps this should be added as a dct:creator. There are two concerns-- the authority that defined the identifier scheme (DOI foundation), and the authority responsible for assigning and maintaining identifiers using that scheme (IEEE).

@makxdekkers I got the impression from the adms doco that the identifier scheme is encoded as the data type in the skos:notation typed literal, so using skos:inScheme would be redundant, and I think its also not consistent with the intention of skos:inScheme.

@riccardoAlbertoni
Copy link
Contributor

in the Representing HTTP dereferenceable secondary identifier section, there seems to be an assumption that the ^^xsd:anyURI type implies that the literal is an HTTP URI, but the data type allows any valid RFC-3986 URI (e.g. urn:), and these might not be dereferenceable.

I see your point @smrgeoinfo, the title is slightly misleading.
I suspect that the only way to know if a URI is HTTP dereferenceable is to try to resolve it as It can be broken.

As far as I can understand, indicating an urn is useful as well. Independently from their dereferenceability, secondary IDs are indicated to say that others might refer to the same dataset with different IDs, they are useful to manage/ group duplicates. So I have made the distinction between dereferenceable and non-deferenceable URIs less sharp.

@riccardoAlbertoni
Copy link
Contributor

@smrgeoinfo wrote

I would suggest that the issuing authority of interest should be the registrant for the 10.1109 doi space, "IEEE Xplore Digital Library", perhaps this should be added as a dct:creator. There are two concerns-- the authority that defined the identifier scheme (DOI foundation), and the authority responsible for assigning and maintaining identifiers using that scheme (IEEE).

@smrgeoinfo Please take a look at example 7, Have I correctly interpreted your suggestion?

@agbeltran
Copy link
Member

To answer 'Question 1' in 'Proposal 1' from @riccardoAlbertoni's notes on the wiki, the DataCite schemas include an XSD with a list of identifier types/schemes here:

https://schema.datacite.org/meta/kernel-4.1/include/datacite-relatedIdentifierType-v4.xsd

@agbeltran
Copy link
Member

Also FAIRsharing keeps a registry of identifier schemes: https://fairsharing.org/standards/?q=&selected_facets=type_exact:identifier%20schema

@agbeltran
Copy link
Member

As regards @smrgeoinfo point on identifying both the identifier scheme and the organisation minting the identifiers, it seems to me that is a use case not covered by ADMS, as adms:schemaAgency covers the name of the "agency that manages the identifier scheme" as a literal, while dct:creator would be used to point to the representation of such organisation rather than a separate one? is that correct @makxdekkers ?

Apart from that interpretation of ADMS, example 7 would cover accounting for both the identifier scheme/type and the organisation maintaining it IMO.

@makxdekkers
Copy link
Contributor

@agbeltran Yes, dct:creator and adms:schemaAgency should be for the same organisation. The literal option was provided because schema agencies might not be in Linked Data space and have no URI.

@riccardoAlbertoni
Copy link
Contributor

Yes, dct:creator and adms:schemaAgency should be for the same organisation. The literal option was provided because schema agencies might not be in Linked Data space and have no URI.`

Then, assuming we want to distinguish between (a) the authority that defined the identifier scheme (DOI foundation), and (b) the authority responsible for assigning and maintaining identifiers using that scheme (IEEE), we need to consider a property distinct from dct:creator for (b)

I see two alternative options here

  1. add a new extra dcat property ( e.g., named dcat:idMantainer/dcat:IdAuthority ) to indicate (b) the authority responsible for assigning and maintaining identifiers using that scheme (IEEE).
  2. use of dct:publisher for indicating (b) instead of defining a new property such as dcat:idMantainer/dcat:IdAuthority . However, I've got the impression dct:creator / dct:publisher are used interchangeably to refer to schema agency (i.e., DOI is the @smrgeoinfo's example), so I do not know if this is really possible.

Which of the two the group thinks is more reasonable?
Does anyone see further options?

@makxdekkers
Copy link
Contributor

makxdekkers commented Nov 24, 2018

@riccardoAlbertoni I am not in favour of your proposal.
As I understand it, the DOI Foundation is the schema agency for DOI. Period. The fact that DOI is organised in such a way that there are registration agencies and registrants for sub-spaces under DOI should be irrelevant. Moreover, naming the registrant goes against the philosophy of DOI where the sub-spaces are abstracted from the organisation that registers them, with the advantage that DOIs don't change when the organisation changes or the responsibility for that sub-space is handed over to someone else. Your proposal risks creating a dependency that DOI itself tries to avoid.
So, in summary, I vote against both options, and suggest to use adms:Identifier as specified allowing only one single agency.

@riccardoAlbertoni
Copy link
Contributor

Thank @makxdekkers for your comment.
If I have correctly interpreted your message you are not in favour of the requirement behind my modelling attempt, namely the need to mention both
a) the authority that defined the identifier scheme (DOI foundation), and
b) the authority responsible for assigning and maintaining identifiers using that scheme (IEEE),
as it was suggested by @smrgeoinfo. @smrgeoinfo Have I misinterpreted your suggestion?

I've found @makxdekkers' motivations convincing, I also guess that similar considerations might hold for other identifier schemes.
So I have included your motivations for not representing (b) in example 7.

@makxdekkers
Copy link
Contributor

Correct, I am not in favour of the requirement to model more than one authority for identifiers.

@smrgeoinfo
Copy link
Contributor

short story:

I think what a user really needs to know is what is the identifier scheme (not who defined it), in particular, if those identifiers can be dereferenced, how can they are dereferenced, and what kind of representations of the identified resource should be available. The agent defining the scheme is not the info needed for this use case. Back to the original question, if identifiers are are required to be http: URIs, the base identifier scheme is known (http), but the practical matter is that various agent embed identifiers within the http uri, and the identifier scheme that matters to the user is not http, but what the embedded scheme is, e.g. doi, ark, igsn...

details

a) the authority that defined the identifier scheme (DOI foundation), and
b) the authority responsible for assigning and maintaining identifiers using that scheme (IEEE),

@riccardoAlbertoni yes you are interpreting my suggestion as intended, and I think @makxdekkers point about the registering agent is valid.

If a registered URI type is used (following RFC-3986), the identifier scheme is part of the URI; a separate identifier scheme property is redundant in that case. If the skos:notation in the adms:identifier has type ^^xsd:anyURI, then the identifier for the scheme should be the prefix on the ID string ('http:' in the example 7).

DOI is registered as a namespace in the 'info' URI scheme (see faq #11 ), so it would appear that to formally encode a DOI as an rfc 3986 URI it would look like 'info:doi/10.1109/5.771073'. The info namespace registry was off line when I tried and check this.

As far as dct:creator, it seems odd to me that the dct:creator property on an adms:Identifer is not the creator of the identifier instance, rather it is the creator of the identifier scheme. This would be confusing if one were not conversant in the usage recommendations for adms; if that's the convention we should stick with it.

To me, the major use case for knowing the identifier scheme is that it should tell you how you can dereference the identifier, and ideally what kind of representations for the identified resource are available, so there is no particular need to identify the agent responsible for actually issuing and maintaining the lifecycle of the identifier, in the case of a DOI, knowing the scheme lets a user know that the registering agent is specified by the prefix part of the id string and there are ways to dereference that.

@agbeltran
Copy link
Member

Marking this issue as 'due for closing' given PR #614

@agbeltran agbeltran added the due for closing Issue that is going to be closed if there are no objection within 6 days label Dec 16, 2018
@agbeltran
Copy link
Member

Closing after merging #614

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests