-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best practice for a loosely-structured catalog #253
Comments
I should have written
though in practice they usually are files in repositories. In strict DCAT terms
But this is all idealized. The point is that most repositories do not require the depositor to make such distinctions, and as long as manually-completed forms are involved there will be resistance or non-compliance from the kind of data depositors that I have in mind (researchers). There might be some heuristics that could be applied, and future automation will help. But my proposal is that with the addition of just one axiom we might accommodate the present reality in a way that improves on current habits - where in the absence of something better, everything in a bag of files is often linked to the dataset using |
As this discussion moved from the mailing list to this issue, for completeness I'm adding the other messages from the mailing list in this thread.
and @andrea-perego said:
|
@dr-shorthair I see. After giving it some thought, I also quite like the idea of a Still, my main concern is that accommodating this kind of loose description adds complexity to consumers of such data (both people and applications such as data catalogs) in the sense that some DCAT records will be described only by a In the end, it all comes down to whether we should accommodate existing behavior where datasets are clearly not described well enough (for various reasons), or encourage describing them properly. Maybe this could be done by at least strongly recommending to stick to the Dataset -> Distribution -> File or Dataset -> Data Distribution Service pattern. |
This discussion relates to proposed use-case ID53 - #256 |
Following up on ACTION assigned in this week's DCAT meeting . Example 1 - undifferentiated set of files each of which is linked to the
Example 2 - The same dataset with the 'files' linked using more precise semantics - four of the files are representations of the data, one is a copy of the source data, one is a zip archive containing the schema/ontology definitions:
|
... and this example is where the set of files are actually representations of parts of the dataset:
|
@dr-shorthair, some questions:
|
@makxdekkers, some responses:
The landing page URLs do work, so you can inspect the sources for yourself. |
@makxdekkers re the
Could we get this example in DCAT? I can't find the API specification to pull it down. |
@dr-shorthair We'll need to ask @andrea-perego. I have no access to the back-end of the JRC catalogue. |
Thanks @dr-shorthair. Here an example where distributions where used for a case of multiple files, as there was no other way of representing this. The example, as provided by the catalogue, is actually in schema.org, but pretty much there is a 1-to-1 mapping.
Here goes an attempt to use
So, my questions/comments would be:
|
@agbeltran Thanks for the example. This is exactly what I would not like to be allowed or encouraged by DCAT, as what you describe can be perfectly well represented as 3 datasets (each with a different temporal coverage and one distribution), and after the DCAT revision, hopefully, using a 4th dataset having these 3 as parts (i.e. dataset series). The issues that you describe, i.e. properties having As I stated earlier, I do not see the value of allowing representation of "just a bag of files" and I would rather encourage publishers to describe the files properly rather than creating messy DCAT data. @dr-shorthair Regarding your usage of blank nodes, coming from the Linked Data community, I would discourage their usage. Simply everything should have an IRI, according to the basic Linked Data principles. No one can anticipate that there will be no interest to link to, e.g. parts of datasets (or datasets in a dataset series, which I think is the same thing). Furthermore, I would object to stating that dataset parts should inherit some properties from their parent dataset, as again this is messier to consume. |
@jakubklimek I understand your concern about the blank nodes. In this issue I was tackling a separate question: the lack of guidance on how to represent the information in many existing catalogs, and the consequent mis-use of the Best practice would certainly be to identify and describe them in their own right. However, as we have no more information available in the catalog that I was quoting from, I was just making sure that the model was correct first. We have already heard that existing catalogs commonly use blank nodes for |
@agbeltran Great. The next step could be to use
|
@dr-shorthair Thanks for the clarification, now I think we are on the same page. Regarding blank nodes, I created #300. +1 on the usage of |
The global domain constraints on Is this OK? |
@dr-shorthair This is interesting, and I think it is not OK. This leads to the question of whether Next question is, whether your referenced files are distributions of another dataset and if so, which one? But then Or, the example needs to be expanded, and these relations would connect to a dataset, which would have to have a distribution, like this: <dataset> dcterms:references [ a dcat:Dataset;
dcat:distribution [
dcterms:identifier "timescale.zip" ;
dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:mediaType <https://www.iana.org/assignments/media-types/application/zip>
]
] ; Regarding the |
I agree with @jakubklimek that it feels wrong. It seems to me that a Distribution is supposed to distribute something. The definition of Distribution says it "Represents a specific available form of a dataset", so there must be a connection to a Dataset, and that connection is modelled using
|
OK - I've updated the example above to interpose a In the case of the For the time being, I've added an additional
In the case of the
Since the latter case refers to an OWL ontology serialized in Turtle packaged in a zip archive, it will need updating when we resolve #259 . |
Resolved in https://www.w3.org/2018/07/19-dxwgdcat-minutes#x09 |
@dr-shorthair wrote:
Sorry, @dr-shorthair & @makxdekkers , for not replying earlier. Here's the relevant RDF (abridged): <http://data.europa.eu/89h/jrc-predict-predict2017-core>
a dcat:Dataset ;
dcterms:accrualPeriodicity <http://publications.europa.eu/resource/authority/frequency/IRREG> ;
dcterms:description """PREDICT includes statistics on ICT industries and their R&D in Europe since 2006. [...]"""@en ;
dcterms:identifier "jrc-predict-predict2017-core" ;
dcterms:isReferencedBy <https://doi.org/10.2760/397817>, <https://doi.org/10.2760/63665> ;
dcterms:issued "2017-05-10"^^xsd:date ;
dcterms:language <http://publications.europa.eu/resource/authority/language/ENG> ;
dcterms:modified "2017-05-10"^^xsd:date ;
dcterms:publisher <http://publications.europa.eu/resource/authority/corporate-body/JRC> ;
dcterms:relation [
dcterms:description "PREDICT webpage (European Commission - JRC Science Hub)"@en ;
dcterms:format <http://publications.europa.eu/resource/authority/file-type/HTML> ;
dcterms:title "Prospective insights on R&D in ICT (PREDICT)"@en ;
dcat:accessURL <https://ec.europa.eu/jrc/en/predict>
] ;
dcterms:spatial <http://publications.europa.eu/resource/authority/continent/AFRICA>, <http://publications.europa.eu/resource/authority/continent/AMERICA>, <http://publications.europa.eu/resource/authority/continent/ANTARCTICA>, <http://publications.europa.eu/resource/authority/continent/ASIA>, <http://publications.europa.eu/resource/authority/continent/EUROPE>, <http://publications.europa.eu/resource/authority/continent/OCEANIA> ;
dcterms:subject <http://eurovoc.europa.eu/100146>, <http://eurovoc.europa.eu/100151> ;
dcterms:temporal [
a dcterms:PeriodOfTime ;
schema:endDate "2016-12-31"^^xsd:date ;
schema:startDate "1995-01-01"^^xsd:date
] ;
dcterms:title "2017 PREDICT Dataset"@en ;
dcat:contactPoint [
a vcard:Kind ;
vcard:hasEmail <mailto:montserrat.lopez-cobo@ec.europa.eu>
] ;
dcat:distribution [
a dcat:Distribution ;
dcterms:accessRights <http://data.jrc.ec.europa.eu/access-rights/no-limitations> ;
dcterms:description "The compressed zip file contains two Excel files splitting the complete 2017 PREDICT Dataset into: macroeconomic variables and R&D related variables."@en ;
dcterms:format <http://publications.europa.eu/resource/authority/file-type/XLS> ;
dcterms:license <http://publications.europa.eu/resource/authority/licence/COM_REUSE> ;
dcterms:title "2017 PREDICT Dataset, Excel file"@en ;
dcat:accessURL <https://ec.europa.eu/jrc/sites/jrcsh/files/2017_predict_core_dataset_xlsx.zip>
], [
a dcat:Distribution ;
dcterms:accessRights <http://data.jrc.ec.europa.eu/access-rights/no-limitations> ;
dcterms:description "The compressed zip file contains a CSV file including the complete 2017 PREDICT Dataset"@en ;
dcterms:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
dcterms:license <http://publications.europa.eu/resource/authority/licence/COM_REUSE> ;
dcterms:title "2017 PREDICT Dataset, CSV file"@en ;
dcat:accessURL <https://ec.europa.eu/jrc/sites/jrcsh/files/2017_predict_core_dataset_csv.zip>
] ;
dcat:keyword "ICT R&D and innovation"@en, "ICT industry analysis"@en, "ICT"@en, "R&D"@en, "digital economy"@en, "information society"@en, "innovation"@en, "statistics"@en ;
dcat:landingPage <https://ec.europa.eu/jrc/en/predict/ict-sector-analysis-2017/data-metadata> ;
dcat:theme <http://publications.europa.eu/resource/authority/data-theme/ECON>, <http://publications.europa.eu/resource/authority/data-theme/TECH> . |
@dr-shorthair raised this in the mailing list:
I had a few concerns regarding this proposal:
dcat:downloadURL
, I would disagree with the possibility to allow linking them directly from adcat:Dataset
record, as this would create mess everywhere where a publisher would be a bit lazy to describe the data properly.dcat:distribution
in a wrong way mainly due to the lack of support for dataset series, which is being resolved in this DCAT revision. When this support is added, publishers will have the possibility of modeling many use cases correctly.The text was updated successfully, but these errors were encountered: