Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice for a loosely-structured catalog #253

Closed
jakubklimek opened this issue Jun 11, 2018 · 21 comments
Closed

Best practice for a loosely-structured catalog #253

jakubklimek opened this issue Jun 11, 2018 · 21 comments

Comments

@jakubklimek
Copy link
Contributor

@dr-shorthair raised this in the mailing list:

I’ve been doing some investigations of some local repositories and catalogues, and have uncovered that in many cases ‘datasets’ are ‘just a bag of files’. There is no distinction made between part/whole, distribution (representation), and other kinds of relationship (e.g. documentation, schema, supporting documents). So while the precision we are aiming for in DCAT is clearly valuable in terms of semantics, it is difficult to implement on these legacy systems. Mostly I see people using the Dataset-distribution-> relationship for everything … which is clearly incorrect in many cases. But I doubt if we are unusual in this.

I’m thinking about how to advise on this, while not actually breaking DCAT.
If we made dcat:distribution a sub-property of dct:relation
dcat:distribution rdfs:subPropertyOf dct:relation .

then I think we can have a reasonable recommendation to the simple repositories.
We could tell repositories that use the ‘just a bag of files’ approach to say

 :Dataset987 a dcat:Dataset ;
     dct:relation <file1> , <file2> , <file3> , <file4> , <file5> , <file6> , <file7> … .

which would not be inconsistent with a later reclassification to

  :Dataset987 a dcat:Dataset ;
              dct:hasPart <file1> , <file2> ;
              dcat:distribution <file3> , <file4> ;
              dct:conformsTo <file5> ;
              dct:requires <file6> ;
              dct:references <file7> .  

If this is not all mad, I will add a new use-case - something like ‘Mapping from simple repository model’ – as justification, and propose this tiny enhancement.

I had a few concerns regarding this proposal:

  1. It is not clear to me from the description what exactly the file* IRIs are. If they were actual downloadable files, i.e. something originally linked using dcat:downloadURL, I would disagree with the possibility to allow linking them directly from a dcat:Dataset record, as this would create mess everywhere where a publisher would be a bit lazy to describe the data properly.
  2. Would it be possible to get a few more detailed examples of how this would work?
  3. In my experience, data publishers use the dcat:distribution in a wrong way mainly due to the lack of support for dataset series, which is being resolved in this DCAT revision. When this support is added, publishers will have the possibility of modeling many use cases correctly.
@dr-shorthair
Copy link
Contributor

dr-shorthair commented Jun 12, 2018

I should have written

<resource1> , <resource2> , <resource3> , <resource4> , <resource5> , <resource6> , <resource7>

though in practice they usually are files in repositories.

In strict DCAT terms

  • <file1> , <file2> are probably better modelled as other individual dcat:Datasets, so their descriptions should have URIs in the context of a catalog
  • <file3> , <file4> are probably dcat:Distributions, so the descriptions would often be blank nodes, with a downloadURL or accessURL to the actual file
  • <file5> is probably another dcat:Dataset though preferably to an online resource (standard schema!)
  • <file6> should probably be another dcat:Dataset
  • <file7> is a document stored as part of the package. Again, strictly another dcat:Dataset somewhere.

But this is all idealized. The point is that most repositories do not require the depositor to make such distinctions, and as long as manually-completed forms are involved there will be resistance or non-compliance from the kind of data depositors that I have in mind (researchers). There might be some heuristics that could be applied, and future automation will help. But my proposal is that with the addition of just one axiom we might accommodate the present reality in a way that improves on current habits - where in the absence of something better, everything in a bag of files is often linked to the dataset using dcat:distribution - which I think we all agree is wrong.

@agbeltran
Copy link
Member

As this discussion moved from the mailing list to this issue, for completeness I'm adding the other messages from the mailing list in this thread.

@makxdekkers said:

Simon,

This is indeed an issue that came up in the development of DCAT-AP. In
particular, CKAN is quite liberal in what it accepts as "Resource" related
to a Dataset. The discussion was whether you could map CKAN Resource to DCAT
Distribution, and it was clear that such mapping would have unwanted
effects. This is also related to my earlier question on how "similar"
distributions need to be, which led to a statement that they need to be
"informationally equivalent" (#52).

I support your proposed solution to use dct:relation as a catch-all and to
allow for further specialisation whenever necessary and possible.

Makx.

and @andrea-perego said:

Makx, Simon,

In the extension of DCAT-AP we use in the JRC Data Catalogue, besides distributions we typically have (a) related publications and (b) "other resources" (a catch-all category including all what is not a distribution or a publication). As I said elsewhere [1,2], related publications are specified via dct:isReferencedBy, whereas "other resources" with dct:relation (used as a generic relationship to link a dataset with any kind of related resources). So, this use case may support the idea of making dcat:distribution a subproperty of dct:relation.

BTW, this pattern is reflected in our CKAN extension – see, e.g.:

http://data.jrc.ec.europa.eu/dataset/jrc-predict-predict2017-core

About the fact that the majority of data catalogues use a simple metadata pattern, this is also my experience. Hierarchical "is part of" relationships are far from being common. There may be a number of reasons. For instance, if metadata are manually created (as it is still usually the case) this would require a high maintenance effort. Also in the geospatial domain, where there's explicitly this notion ("dataset series"), what is documented is just the "root" dataset, and the children are not even linked to. Another issue may be related to limitations of catalogue platforms – which are typically not supporting this feature – or to the usability issues resulting from giving users the burden to choose among a long list of datasets which are almost identical but for some variables (e.g., spatial and/or temporal coverage).

It is also worth noting that the approach used for specifying hierarchical relationships depends very much on the domain and on specific characteristics of a dataset. We have to deal quite often with this issue in the JRC Data Catalogue, and the approaches used are very different – e.g.: 1 dataset with a distribution for each of its children; 1 dataset for each child dataset, and no record for the parent.

So, probably, we should take into account this situation when providing recommendations on how to model hierarchical/subsetting relationships, and propose alternative options, depending on the specific use case.

Cheers,

Andrea

[1] https://www.w3.org/TR/dcat-ucr/#ID9
[2] #63 (comment)

@agbeltran agbeltran added the dcat label Jun 12, 2018
@jakubklimek
Copy link
Contributor Author

@dr-shorthair I see. After giving it some thought, I also quite like the idea of a dcat:distribution being just one of the possible dct:relations.

Still, my main concern is that accommodating this kind of loose description adds complexity to consumers of such data (both people and applications such as data catalogs) in the sense that some DCAT records will be described only by a dcat:Dataset with a bunch of dct:related resources, others will have proper dcat:distributions and the consumers will have to account for all these possibilities and maybe more. The benefit is that maybe some publishers using dcat:distributions wrong, will use dct:related instead.

In the end, it all comes down to whether we should accommodate existing behavior where datasets are clearly not described well enough (for various reasons), or encourage describing them properly. Maybe this could be done by at least strongly recommending to stick to the Dataset -> Distribution -> File or Dataset -> Data Distribution Service pattern.

@dr-shorthair
Copy link
Contributor

This discussion relates to proposed use-case ID53 - #256

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Jul 7, 2018

Following up on ACTION assigned in this week's DCAT meeting .

Example 1 - undifferentiated set of files each of which is linked to the dcat:Dataset using dcterms:relation whose object is a blank node:

dap:d33937
  rdf:type dcat:Dataset ;
  dcterms:accessRights [
      rdf:type dcterms:RightsStatement ;
      rdfs:comment "The metadata and files (if any) are available to the public." ;
    ] ;
  dcterms:bibliographicCitation "Cox, Simon (2018): RDF representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale). v1. CSIRO. Data Collection. https://doi.org/10.25919/5b42a082052fa" ;
  dcterms:description "A set of RDF graphs representing the International [Chrono]stratigraphic Chart, ..." ;
  dcterms:identifier "https://doi.org/10.25919/5b4d2b83cbf2d"^^xsd:anyURI ;
  dcterms:issued "2018-07-17"^^xsd:date ;
  dcterms:language [
      skos:notation "en" ;
    ] ;
  dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
  dcterms:relation <http://resource.geosciml.org/classifier/ics/ischart/> ;
  dcterms:relation <http://resource.geosciml.org/ontology/timescale/gts> ;
  dcterms:relation <http://stratigraphy.org/> ;
  dcterms:relation <https://vocabs.ands.org.au/viewById/196> ;
  dcterms:relation [      dcterms:identifier "ChronostratChart2017-02.pdf" ;    ] ;
  dcterms:relation [      dcterms:identifier "ChronostratChart2017-02.jpg" ;    ] ;
  dcterms:relation [      dcterms:identifier "isc2017.jsonld" ;    ] ;
  dcterms:relation [      dcterms:identifier "isc2017.nt" ;    ] ;
  dcterms:relation [      dcterms:identifier "isc2017.rdf" ;    ] ;
  dcterms:relation [      dcterms:identifier "isc2017.ttl" ;    ] ;
  dcterms:relation [      dcterms:identifier "timescale.zip" ;    ] ;
  dcterms:rights [
      rdf:type dcterms:RightsStatement ;
      rdfs:comment "All Rights (including copyright) CSIRO 2018." ;
    ] ;
  dcterms:title "RDF representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale)" ;
  dcat:contactPoint <https://people.csiro.au/C/S/Simon-Cox> ;
  dcat:keyword "GeoSPARQL" , "OWL" , "OWL-Time" , "RDF" , "SOSA" , "SSN" , "geologic timescale" , "reference system" , "stratigraphy" , "vocabulary" ;
  dcat:landingPage <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
  dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/information-engineering-and-theory> ;
  dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/interorganisational-information-systems-and-web-services> ;
  dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/stratigraphy> ;
  dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/web-technologies> ;
.

Example 2 - The same dataset with the 'files' linked using more precise semantics - four of the files are representations of the data, one is a copy of the source data, one is a zip archive containing the schema/ontology definitions:

dap:d33937
  rdf:type dcat:Dataset ;
  dcterms:accessRights [
      rdf:type dcterms:RightsStatement ;
      rdfs:comment "The metadata and files (if any) are available to the public." ;
    ] ;
  dcterms:bibliographicCitation "Cox, Simon (2018): RDF representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale). v1. CSIRO. Data Collection. https://doi.org/10.25919/5b42a082052fa" ;
  dcterms:description "A set of RDF graphs representing the International [Chrono]stratigraphic Chart, ..." ;
  dcterms:identifier "https://doi.org/10.25919/5b4d2b83cbf2d"^^xsd:anyURI ;
  dcterms:issued "2018-07-17"^^xsd:date ;
  dcterms:language [
      skos:notation "en" ;
    ] ;
  dcterms:relation <http://resource.geosciml.org/classifier/ics/ischart/> ;
  dcterms:relation <http://resource.geosciml.org/ontology/timescale/gts> ;
  dcterms:relation <http://stratigraphy.org/> ;
  dcterms:relation <https://vocabs.ands.org.au/viewById/196> ;
  dcterms:isFormatOf [
      rdf:type dcat:Dataset ;
      dcterms:source <http://stratigraphy.org/index.php/ics-chart-timescale> ;
      dcterms:title "Graphical representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale)" ;
      dcterms:type dctype:Image ;
      dcat:distribution [
          rdf:type dcat:Distribution ;
          dcterms:identifier "ChronostratChart2017-02.jpg" ;
          dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
          dcat:byteSize "1629104"^^xsd:decimal ;
          dcat:mediaType <https://www.iana.org/assignments/media-types/image/jpeg> ;
        ] ;
      dcat:distribution [
          rdf:type dcat:Distribution ;
          dcterms:identifier "ChronostratChart2017-02.pdf" ;
          dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
          dcat:byteSize "296233"^^xsd:decimal ;
          dcat:mediaType <https://www.iana.org/assignments/media-types/application/pdf> ;
        ] ;
    ] ;
  dcat:distribution [
      rdf:type dcat:Distribution ;
      dcterms:identifier "isc2017.jsonld" ;
      dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
      dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
      dcat:byteSize "698039"^^xsd:decimal ;
      dcat:mediaType <https://www.iana.org/assignments/media-types/application/ld+json> ;
    ] ;
  dcat:distribution [
      rdf:type dcat:Distribution ;
      dcterms:identifier "isc2017.nt" ;
      dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
      dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
      dcat:byteSize "2047874"^^xsd:decimal ;
      dcat:mediaType <https://www.iana.org/assignments/media-types/application/n-triples> ;
    ] ;
  dcat:distribution [
      rdf:type dcat:Distribution ;
      dcterms:identifier "isc2017.rdf" ;
      dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
      dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
      dcat:byteSize "1600569"^^xsd:decimal ;
      dcat:mediaType <https://www.iana.org/assignments/media-types/application/rdf+xml> ;
    ] ;
  dcat:distribution [
      rdf:type dcat:Distribution ;
      dcterms:identifier "isc2017.ttl" ;
      dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
      dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
      dcat:byteSize "531703"^^xsd:decimal ;
      dcat:mediaType <https://www.iana.org/assignments/media-types/text/turtle> ;
    ] ;
  dcterms:references [
      rdf:type dcat:Dataset ;
      dcterms:title "Geological timescale ontology" ;
      dcterms:type owl:Ontology ;
      dcat:distribution [
          dcterms:identifier "timescale.zip" ;
          dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
          dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
          dcat:mediaType <https://www.iana.org/assignments/media-types/application/zip> ;
        ] ;
    ] ;
  dcterms:rights [
      rdf:type dcterms:RightsStatement ;
      rdfs:comment "All Rights (including copyright) CSIRO 2018." ;
    ] ;
  dcterms:title "RDF representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale)" ;
  dcat:contactPoint <https://people.csiro.au/C/S/Simon-Cox> ;
  dcat:keyword "GeoSPARQL" , "OWL" , "OWL-Time" , "RDF" , "SOSA" , "SSN" , "geologic timescale" , "reference system" , "stratigraphy" , "vocabulary" ;
  dcat:landingPage <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
  dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/information-engineering-and-theory> ;
  dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/interorganisational-information-systems-and-web-services> ;
  dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/stratigraphy> ;
  dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/web-technologies> ;
.

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Jul 8, 2018

... and this example is where the set of files are actually representations of parts of the dataset:

  1. First, just using dct:relation
dap:atnf-P366-2003SEPT
  rdf:type dcat:Dataset ;
  dcterms:accessRights [
      rdf:type dcterms:RightsStatement ;
      rdfs:comment "The metadata and files (if any) are available to the public." ;
    ] ;
  dcterms:bibliographicCitation "Burgay, M; McLaughlin, M; Kramer, M; Lyne, A; Joshi, B; Pearce, G; D'Amico, N; Possenti, A; Manchester, R; Camilo, F (2017): Parkes observations for project P366 semester 2003SEPT. v1. CSIRO. Data Collection. https://doi.org/10.4225/08/598dc08d07bb7" ;
  dcterms:description "Parkes multibeam high-latitude pulsar survey" ;
  dcterms:identifier "https://doi.org/10.4225/08/598dc08d07bb7"^^xsd:anyURI ;
  dcterms:identifier "ivo://au.csiro.atnf/P366-2003SEPT"^^xsd:anyURI ;
  dcterms:identifier [
      rdf:type adms:Identifier ;
      dcterms:creator <https://www.doi.org/> ;
      skos:notation "10.4225/08/598dc08d07bb7" ;
      adms:schemeAgency "International DOI Foundation" ;
    ] ;
  dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
  dcterms:modified "2017-07-30T08:55:55Z"^^xsd:dateTime ;
  dcterms:relation [      dcterms:identifier "PH0090_0011.sf" ;    ] ;
  dcterms:relation [      dcterms:identifier "PH0090_0021.sf" ;    ] ;
  dcterms:relation [      dcterms:identifier "PH0090_0031.sf" ;    ] ;
  dcterms:rights [
      rdf:type dcterms:RightsStatement ;
      rdfs:comment "All Rights (including copyright) CSIRO 2017." ;
    ] ;
  dcterms:temporal [
      rdf:type dcterms:PeriodOfTime ;
      rdf:type time:ProperInterval ;
      time:hasBeginning [
          rdf:type time:Instant ;
          time:inXSDDate "2003-09-01"^^xsd:date ;
        ] ;
      time:hasEnd [
          rdf:type time:Instant ;
          time:inXSDDate "2003-12-31"^^xsd:date ;
        ] ;
    ] ;
  dcterms:title "Parkes observations for project P366 semester 2003SEPT" ;
  dcat:contactPoint [
      rdf:type v:Individual ;
      v:fn "Marta Burgay" ;
      v:hasEmail <mailto:burgay@oa-cagliari.inaf.it> ;
    ] ;
  dcat:keyword "pulsar" ;
  dcat:landingPage <https://data.csiro.au/dap/landingpage?pid=csiro:P366-2003SEPT> ;
  dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/astronomical-and-space-sciences-not-elsewhere-classified> ;
.
  1. And using more precise semantics, Since the files are each a representation of part of the dataset, they are described as distributions of (anonymous) datasets which are linked using the dct:hasPart relationship:

dap:atnf-P366-2003SEPT_1
  rdf:type dcat:Dataset ;
  dcterms:accessRights [
      rdf:type dcterms:RightsStatement ;
      rdfs:comment "The metadata and files (if any) are available to the public." ;
    ] ;
  dcterms:bibliographicCitation "Burgay, M; McLaughlin, M; Kramer, M; Lyne, A; Joshi, B; Pearce, G; D'Amico, N; Possenti, A; Manchester, R; Camilo, F (2017): Parkes observations for project P366 semester 2003SEPT. v1. CSIRO. Data Collection. https://doi.org/10.4225/08/598dc08d07bb7" ;
  dcterms:description "Parkes multibeam high-latitude pulsar survey" ;
  dcterms:hasPart [
      rdf:type dcat:Dataset ;
      dcat:distribution [
          rdf:type dcat:Distribution ;
          dcterms:identifier "PH0090_0011.sf" ;
          dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
          dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:P366-2003SEPT> ;
          dcat:byteSize "1000000000"^^xsd:decimal ;
        ] ;
    ] ;
  dcterms:hasPart [
      rdf:type dcat:Dataset ;
      dcat:distribution [
          rdf:type dcat:Distribution ;
          dcterms:identifier "PH0090_0021.sf" ;
          dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
          dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:P366-2003SEPT> ;
          dcat:byteSize "402000000"^^xsd:decimal ;
        ] ;
    ] ;
  dcterms:hasPart [
      rdf:type dcat:Dataset ;
      dcat:distribution [
          rdf:type dcat:Distribution ;
          dcterms:identifier "PH0090_0031.sf" ;
          dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
          dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:P366-2003SEPT> ;
          dcat:byteSize "82000000"^^xsd:decimal ;
        ] ;
    ] ;
  dcterms:identifier "https://doi.org/10.4225/08/598dc08d07bb7"^^xsd:anyURI ;
  dcterms:identifier "ivo://au.csiro.atnf/P366-2003SEPT"^^xsd:anyURI ;
  dcterms:identifier [
      rdf:type adms:Identifier ;
      dcterms:creator <https://www.doi.org/> ;
      skos:notation "10.4225/08/598dc08d07bb7" ;
      adms:schemeAgency "International DOI Foundation" ;
    ] ;
  dcterms:modified "2017-07-30T08:55:55Z"^^xsd:dateTime ;
  dcterms:rights [
      rdf:type dcterms:RightsStatement ;
      rdfs:comment "All Rights (including copyright) CSIRO 2017." ;
    ] ;
  dcterms:temporal [
      rdf:type dcterms:PeriodOfTime ;
      rdf:type time:ProperInterval ;
      time:hasBeginning [
          rdf:type time:Instant ;
          time:inXSDDate "2003-09-01"^^xsd:date ;
        ] ;
      time:hasEnd [
          rdf:type time:Instant ;
          time:inXSDDate "2003-12-31"^^xsd:date ;
        ] ;
    ] ;
  dcterms:title "Parkes observations for project P366 semester 2003SEPT" ;
  dcat:contactPoint [
      rdf:type v:Individual ;
      v:fn "Marta Burgay" ;
      v:hasEmail <mailto:burgay@oa-cagliari.inaf.it> ;
    ] ;
  dcat:keyword "pulsar" ;
  dcat:landingPage <https://data.csiro.au/dap/landingpage?pid=csiro:P366-2003SEPT> ;
  dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/astronomical-and-space-sciences-not-elsewhere-classified> ;
.

@makxdekkers
Copy link
Contributor

@dr-shorthair, some questions:

  1. You have statements like dcterms:relation [ dcterms:identifier "isc2017.ttl" ; ] ;. I see this is a blank node, but I can't see what the referenced resource is or where it can be accessed. Should there not be a link to the file, rather than just the identifier?
  2. In the second example, I se you model this with a 'parent' dataset that has no distributions but links to parts which themselves are datasets. However, the part datasets are not 'real' datasets as they have no URI for themselves. The also have no metadata, just a distributions. Is the idea they 'inherit' metadata from the 'parent'?
  3. In general, you work with an approach that only assigns a URI to the top-level dataset and uses blank nodes for everything else (part datasets, distributions, rights statements, contact points). I think it is good practice to assign URIs for all individuals. Maybe not for arbitrary time periods but everything else should be individually addressable and could be stored as such in a database. Also, some things might be reused locally (e.g. the CSIRO rights statement). But maybe the examples are a simplification and a real implementation would assign those URIs?

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Jul 9, 2018

@makxdekkers, some responses:

  1. The repository that these examples come from does not assign external identifiers to the files/elements. Download access is mediated by a form. So this identification method was the best I could come up with.

  2. Correct. As explained in the (revised) commentary above, these files are representations of parts of the dataset. Representations are usually modeled as dcat:Distribution. My sense is that a dct:hasPart relationship should be between dcat:Datasets. So I tried to respect these various issues using blank nodes for the notional (undescribed) datasets which have distributions that are the actual files.

  3. These are real examples from CSIRO's Data Access Portal (DAP). The DCAT descriptions are, however, manually constructed by me. In the first description for each one I have not use any information that is not already in the metadata in the DAP. It is not perfectly aligned with DCAT, but is a real repository. The goal of this issue is to propose that we develop guidelines for such imperfect 'legacy' repositories.

The landing page URLs do work, so you can inspect the sources for yourself.
https://data.csiro.au/dap/landingpage?pid=csiro:33937
https://data.csiro.au/dap/landingpage?pid=csiro:P366-2003SEPT

@dr-shorthair
Copy link
Contributor

@makxdekkers re the

CKAN extension – see, e.g.:
http://data.jrc.ec.europa.eu/dataset/jrc-predict-predict2017-core

Could we get this example in DCAT? I can't find the API specification to pull it down.

@makxdekkers
Copy link
Contributor

@dr-shorthair We'll need to ask @andrea-perego. I have no access to the back-end of the JRC catalogue.

@agbeltran
Copy link
Member

agbeltran commented Jul 12, 2018

Thanks @dr-shorthair. Here an example where distributions where used for a case of multiple files, as there was no other way of representing this.

The example, as provided by the catalogue, is actually in schema.org, but pretty much there is a 1-to-1 mapping.

[] a schema:Dataset ;
    schema:creator [ a schema:Organization ;
            schema:name "Ofsted" ] ;
    schema:dateModified "2016-12-12T14:16:44.522Z"^^schema:Date ;
    schema:description "The outstanding providers list includes early years registered providers, maintained schools, independent schools, colleges and providers of work-based learning, adult education and children?s social care.  Two datasets are included: the first lists of all those providers who met the outstanding provider criteria in the most recent year for which data is available; the second is a list of all providers who have met the applicable criteria in any year since 1993. In the second list the year(s) in which that provider was included are also shown." ;
    schema:distribution [ a schema:DataDownload ;
            schema:contentUrl <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/481154/Outstanding_Providers_List_1993-2014.csv> ;
            schema:fileFormat <CSV> ;
            schema:name "Outstanding Providers list 1993-2014" ],
        [ a schema:DataDownload ;
            schema:contentUrl <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/480700/Outstanding_providers_list_2014-15.csv> ;
            schema:fileFormat <CSV> ;
            schema:name "Outstanding Providers list 2014-2015" ],
        [ a schema:DataDownload ;
            schema:contentUrl <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/571915/Outstanding_Providers_List_2015-16.ods> ;
            schema:fileFormat <ODS> ;
            schema:name "Outstanding Providers list 2015-2016" ] ;
    schema:includedInDataCatalog [ a schema:DataCatalog ;
            schema:url <https://data.gov.uk/> ] ;
    schema:keywords "Education" ;
    schema:license [ a schema:CreativeWork ;
            schema:name "Open Government Licence" ;
            schema:url <http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/> ] ;
    schema:name "Outstanding providers list" ;
    schema:url <https://data.gov.uk/dataset/63f9c959-00b6-4c51-b165-47f387ff7881/outstanding-providers-list> .

Here goes an attempt to use dcterms:relation instead:

[] a dcat:Dataset ;
    dcat:publisher [ a foaf:Organization ;
            rdfs:label "Ofsted" ] ;
    dct:modified "2016-12-12T14:16:44.522Z"^^schema:Date ;
    dct:description "The outstanding providers list includes early years registered providers, maintained schools, independent schools, colleges and providers of work-based learning, adult education and children?s social care.  Two datasets are included: the first lists of all those providers who met the outstanding provider criteria in the most recent year for which data is available; the second is a list of all providers who have met the applicable criteria in any year since 1993. In the second list the year(s) in which that provider was included are also shown." ;
    dcterms:relation [            
            dcat:downloadURL <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/481154/Outstanding_Providers_List_1993-2014.csv> ;
            dcat:mediaType "text/csv" ;
            dct:title "Outstanding Providers list 1993-2014" ],
        [ 
            dcat:downloadURL <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/480700/Outstanding_providers_list_2014-15.csv> ;
            dcat:mediaType "text/csv" ;
            dct:title "Outstanding Providers list 2014-2015" ],
        [
            dcat:downloadURL <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/571915/Outstanding_Providers_List_2015-16.ods> ;
            dct:format <ODS> ;
            dct:title "Outstanding Providers list 2015-2016" ] ;   
    dcat:keywords "Education" ;
    dcterms:license [  
            dct:title "Open Government Licence" ;
            schema:url <http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/> ] ;
    dct:title "Outstanding providers list" ;
    dct:identifier <https://data.gov.uk/dataset/63f9c959-00b6-4c51-b165-47f387ff7881/outstanding-providers-list> .

So, my questions/comments would be:

  • using dcterms:relation in this way to point to multiple files that are not really distributions is simple and useful way to cover the use case, which wasn't cover in DCAT before
  • I'm using dcat:downloadURL above, but this is wrong as it has domain dcat:Distribution - what property to use instead? dcat:accessURL is also for distributions.
  • supporting the use of dcterms:relation in this way, it is quite likely that developers would choose this simple representation even when the use of dcat:distribution would be appropriate; so, do we need to encourage the use of the richer semantics representation as per @dr-shorthair examples (through guidance documentation in the spec, a primer, examples, etc) and what would be the consequences of people using the simple representation instead?

@jakubklimek
Copy link
Contributor Author

@agbeltran Thanks for the example. This is exactly what I would not like to be allowed or encouraged by DCAT, as what you describe can be perfectly well represented as 3 datasets (each with a different temporal coverage and one distribution), and after the DCAT revision, hopefully, using a 4th dataset having these 3 as parts (i.e. dataset series).

The issues that you describe, i.e. properties having dcat:Distribution as domain, I see as a natural consequence of insufficient metadata description, not something that should be supported, which would probably lead to further relaxation of the domains, and therefore greater mess in DCAT data.

As I stated earlier, I do not see the value of allowing representation of "just a bag of files" and I would rather encourage publishers to describe the files properly rather than creating messy DCAT data.

@dr-shorthair Regarding your usage of blank nodes, coming from the Linked Data community, I would discourage their usage. Simply everything should have an IRI, according to the basic Linked Data principles. No one can anticipate that there will be no interest to link to, e.g. parts of datasets (or datasets in a dataset series, which I think is the same thing). Furthermore, I would object to stating that dataset parts should inherit some properties from their parent dataset, as again this is messier to consume.

@dr-shorthair
Copy link
Contributor

@jakubklimek I understand your concern about the blank nodes. In this issue I was tackling a separate question: the lack of guidance on how to represent the information in many existing catalogs, and the consequent mis-use of the dcat:distribution property. The examples above are merely concerned with getting the modeling right. The key point is to propose that dct:hasPart relationships should be to other datasets, not to distributions.

Best practice would certainly be to identify and describe them in their own right. However, as we have no more information available in the catalog that I was quoting from, I was just making sure that the model was correct first.

We have already heard that existing catalogs commonly use blank nodes for Distributions. So we should probably tackle recommendations around blank nodes generally in a separate issue. Perhaps you can create that?

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Jul 13, 2018

@agbeltran Great. The next step could be to use dct:hasPart - a sub-property of dct:relation to finish the job.

 a dcat:Dataset ;
    dcat:publisher [ a foaf:Organization ;
            rdfs:label "Ofsted" ] ;
    dct:modified "2016-12-12T14:16:44.522Z"^^schema:Date ;
    dct:description "The outstanding providers list includes early years registered providers, maintained schools, independent schools, colleges and providers of work-based learning, adult education and children?s social care.  Two datasets are included: the first lists of all those providers who met the outstanding provider criteria in the most recent year for which data is available; the second is a list of all providers who have met the applicable criteria in any year since 1993. In the second list the year(s) in which that provider was included are also shown." ;
    dct:hasPart [
            a dcat:Dataset ;
            dcat:distribution [            
                dcat:downloadURL <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/481154/Outstanding_Providers_List_1993-2014.csv> ;
                dcat:mediaType "text/csv" ;
                dct:title "Outstanding Providers list 1993-2014" ] ;
            ] ;
    dct:hasPart [
            a dcat:Dataset ;
            dcat:distribution [            
                dcat:downloadURL <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/480700/Outstanding_providers_list_2014-15.csv> ;
                dcat:mediaType "text/csv" ;
                dct:title "Outstanding Providers list 2014-2015" ] ;
            ] ;
    dct:hasPart [
            a dcat:Dataset ;
            dcat:distribution [            
                dcat:downloadURL <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/571915/Outstanding_Providers_List_2015-16.ods> ;
                dct:format <ODS> ;
                dct:title "Outstanding Providers list 2015-2016" ] ;   
            ] ;
    dcat:keywords "Education" ;
    dct:license [  
            dct:title "Open Government Licence" ;
            schema:url <http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/> ] ;
    dct:title "Outstanding providers list" ;
    dct:identifier <https://data.gov.uk/dataset/63f9c959-00b6-4c51-b165-47f387ff7881/outstanding-providers-list> .

@jakubklimek
Copy link
Contributor Author

@dr-shorthair Thanks for the clarification, now I think we are on the same page.

Regarding blank nodes, I created #300.

+1 on the usage of dcterms:hasPart in @agbeltran example creating a dataset series (important for #81).

@dr-shorthair
Copy link
Contributor

The global domain constraints on dcat:accessURL and dcat:mediaType entail that the resources entitled "ChronostratChart2017-02.pdf" and "timescale.zip" in the example above are both of type dcat:Distribution, although their relationship to the Dataset is not dcat:distribution.

Is this OK?

@jakubklimek
Copy link
Contributor Author

@dr-shorthair This is interesting, and I think it is not OK. This leads to the question of whether dcat:Distributions can exist independently of datasets - i.e. distributions which are no part of any dataset. I would say they cannot... they are by definition distributions of a dataset.

Next question is, whether your referenced files are distributions of another dataset and if so, which one? But then dcterms:references and dcterms:isFormatOf would connect a dataset to another dataset's distribution, which I think is not right.

Or, the example needs to be expanded, and these relations would connect to a dataset, which would have to have a distribution, like this:

<dataset> dcterms:references [ a dcat:Dataset;
      dcat:distribution [
        dcterms:identifier "timescale.zip" ;
        dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
        dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
        dcat:mediaType <https://www.iana.org/assignments/media-types/application/zip> 
        ]
    ] ;

Regarding the isFormatOf relation, since it is defined as A related resource that is substantially the same as the described resource, but in another format., I would see this as a relation between two distributions (those have formats), not datasets, which are independent of formats.

@makxdekkers
Copy link
Contributor

I agree with @jakubklimek that it feels wrong. It seems to me that a Distribution is supposed to distribute something. The definition of Distribution says it "Represents a specific available form of a dataset", so there must be a connection to a Dataset, and that connection is modelled using dcat:distribution. How then to relate the timescale.zip file to the Dataset depends on the role of that file in relation to the Dataset. It is not obvious from the example that the file is a distribution of some other dataset. If it is, then @jakubklimek's suggestion might work, but otherwise maybe using more general properties that do not infer that the file is a distribution of anything:

dcterms:references [
      dcterms:identifier "timescale.zip" ;
      dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
      foaf:page <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
      dcterms:format <https://www.iana.org/assignments/media-types/application/zip>

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Jul 17, 2018

OK - I've updated the example above to interpose a dcat:Dataset node in front of the distributions (currently a blank node - sorry @jakubklimek).

In the case of the dct:isFormatOf relation, the current resource is an RDF dataset, while the predecessor is an Image. The RDF is a re-formulation of the data on the image. Perhaps there is a better predicate than dcat:isFormatOf?

For the time being, I've added an additional Distribution of the image to reinforce the message that this relationship is between datasets, each of which can have multiple representations:

dap:d33937
  rdf:type dcat:Dataset ;
  dcterms:title "RDF representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale)" ;
  dcterms:isFormatOf [
      rdf:type dcat:Dataset ;
      dcterms:source <http://stratigraphy.org/index.php/ics-chart-timescale> ;
      dcterms:title "Graphical representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale)" ;
      dcterms:type dctype:Image ;
      dcat:distribution [
          rdf:type dcat:Distribution ;
          dcterms:identifier "ChronostratChart2017-02.jpg" ;
          dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
          dcat:byteSize "1629104"^^xsd:decimal ;
          dcat:mediaType <https://www.iana.org/assignments/media-types/image/jpeg> ;
        ] ;
      dcat:distribution [
          rdf:type dcat:Distribution ;
          dcterms:identifier "ChronostratChart2017-02.pdf" ;
          dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
          dcat:byteSize "296233"^^xsd:decimal ;
          dcat:mediaType <https://www.iana.org/assignments/media-types/application/pdf> ;
        ] ;
    ] ;
.

In the case of the dct:references relation the target has type owl:Ontology.

dap:d33937
  rdf:type dcat:Dataset ;
  dcterms:references [
      rdf:type dcat:Dataset ;
      dcterms:title "Geological timescale ontology" ;
      dcterms:type owl:Ontology ;
      dcat:distribution [
          dcterms:identifier "timescale.zip" ;
          dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
          dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
          dcat:mediaType <https://www.iana.org/assignments/media-types/application/zip> ;
        ] ;
    ] ;
.

Since the latter case refers to an OWL ontology serialized in Turtle packaged in a zip archive, it will need updating when we resolve #259 .

@dr-shorthair
Copy link
Contributor

Resolved in https://www.w3.org/2018/07/19-dxwgdcat-minutes#x09
See PR #295

@andrea-perego
Copy link
Contributor

@dr-shorthair wrote:

@makxdekkers re the

CKAN extension – see, e.g.:
http://data.jrc.ec.europa.eu/dataset/jrc-predict-predict2017-core

Could we get this example in DCAT? I can't find the API specification to pull it down.

Sorry, @dr-shorthair & @makxdekkers , for not replying earlier. Here's the relevant RDF (abridged):

<http://data.europa.eu/89h/jrc-predict-predict2017-core>
  a dcat:Dataset ;
  dcterms:accrualPeriodicity <http://publications.europa.eu/resource/authority/frequency/IRREG> ;
  dcterms:description """PREDICT includes statistics on ICT industries and their R&D in Europe since 2006. [...]"""@en ;
  dcterms:identifier "jrc-predict-predict2017-core" ;
  dcterms:isReferencedBy <https://doi.org/10.2760/397817>, <https://doi.org/10.2760/63665> ;
  dcterms:issued "2017-05-10"^^xsd:date ;
  dcterms:language <http://publications.europa.eu/resource/authority/language/ENG> ;
  dcterms:modified "2017-05-10"^^xsd:date ;
  dcterms:publisher <http://publications.europa.eu/resource/authority/corporate-body/JRC> ;
  dcterms:relation [
    dcterms:description "PREDICT webpage (European Commission - JRC Science Hub)"@en ;
    dcterms:format <http://publications.europa.eu/resource/authority/file-type/HTML> ;
    dcterms:title "Prospective insights on R&D in ICT (PREDICT)"@en ;
    dcat:accessURL <https://ec.europa.eu/jrc/en/predict>
  ] ;
  dcterms:spatial <http://publications.europa.eu/resource/authority/continent/AFRICA>, <http://publications.europa.eu/resource/authority/continent/AMERICA>, <http://publications.europa.eu/resource/authority/continent/ANTARCTICA>, <http://publications.europa.eu/resource/authority/continent/ASIA>, <http://publications.europa.eu/resource/authority/continent/EUROPE>, <http://publications.europa.eu/resource/authority/continent/OCEANIA> ;
  dcterms:subject <http://eurovoc.europa.eu/100146>, <http://eurovoc.europa.eu/100151> ;
  dcterms:temporal [
    a dcterms:PeriodOfTime ;
    schema:endDate "2016-12-31"^^xsd:date ;
    schema:startDate "1995-01-01"^^xsd:date
  ] ;
  dcterms:title "2017 PREDICT Dataset"@en ;
  dcat:contactPoint [
    a vcard:Kind ;
    vcard:hasEmail <mailto:montserrat.lopez-cobo@ec.europa.eu>
  ] ;
  dcat:distribution [
    a dcat:Distribution ;
    dcterms:accessRights <http://data.jrc.ec.europa.eu/access-rights/no-limitations> ;
    dcterms:description "The compressed zip file contains two Excel files splitting the complete 2017 PREDICT Dataset into: macroeconomic variables and R&D related variables."@en ;
    dcterms:format <http://publications.europa.eu/resource/authority/file-type/XLS> ;
    dcterms:license <http://publications.europa.eu/resource/authority/licence/COM_REUSE> ;
    dcterms:title "2017 PREDICT Dataset, Excel file"@en ;
    dcat:accessURL <https://ec.europa.eu/jrc/sites/jrcsh/files/2017_predict_core_dataset_xlsx.zip>
  ], [
    a dcat:Distribution ;
    dcterms:accessRights <http://data.jrc.ec.europa.eu/access-rights/no-limitations> ;
    dcterms:description "The compressed zip file contains a CSV file including the complete 2017 PREDICT Dataset"@en ;
    dcterms:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
    dcterms:license <http://publications.europa.eu/resource/authority/licence/COM_REUSE> ;
    dcterms:title "2017 PREDICT Dataset, CSV file"@en ;
    dcat:accessURL <https://ec.europa.eu/jrc/sites/jrcsh/files/2017_predict_core_dataset_csv.zip>
  ] ;
  dcat:keyword "ICT R&D and innovation"@en, "ICT industry analysis"@en, "ICT"@en, "R&D"@en, "digital economy"@en, "information society"@en, "innovation"@en, "statistics"@en ;
  dcat:landingPage <https://ec.europa.eu/jrc/en/predict/ict-sector-analysis-2017/data-metadata> ;
  dcat:theme <http://publications.europa.eu/resource/authority/data-theme/ECON>, <http://publications.europa.eu/resource/authority/data-theme/TECH> .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants