-
Notifications
You must be signed in to change notification settings - Fork 9
1. Adapting the Metadata
The metadata determine the structure of your Linked Data file. You can change the outcome of CoWs conversion process by adapting the metadata.json
file which the tool generates.
This section discusses the following components of the JSON schema file:
Each section includes an exercise for you to try and change a metadata file yourself.
The base URI determines what your URI's will start out with. This URI is at the start of the metadata.json
file. In the example below, the base URI is set to "https://iisg.amsterdam/"
. Thus all URI's in the Linked Data file will start with this URI.
{
"@context": [
"https://raw.githubusercontent.com/CLARIAH/COW/master/csvw.json",
{
"@language": "en",
"@base": "https://iisg.amsterdam/"
},
You can create your own base URI. For example:
{
"@context": [
"https://raw.githubusercontent.com/CLARIAH/COW/master/csvw.json",
{
"@language": "en",
"@base": "http://raw-data-now.org/trial-1/"
},
Changing the base URI is a great idea when you are trying things out. You can easily distinguish different versions by adding trial-1
, trial-2
, etcetera.
Try changing the base URI yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the buurt.csv
file. Next, follow these steps:
- Upload the example csv file:
cow_tool build buurt.csv
The tool generates a -metadata.json
file in the folder where you saved the example file.
-
Open and edit the metadata file. Add
"trial-1/"
to the base URI on line 6. Make sure to save the changes. -
Create the Linked Data file with the following command:
cow_tool convert buurt.csv
The end result is an .nq
file. When you open this file and search for "/trial-1/"
you should have 76 matches.
Prefixes abbreviate URI's to save you the trouble of typing full URI's. The @context
part of the JSON schema also contains prefixes as illustrated below.
{
"@context": [
"https://raw.githubusercontent.com/CLARIAH/COW/master/csvw.json",
{
"@language": "en",
"@base": "https://iisg.amsterdam/"
},
{
"aat": "http://vocab.getty.edu/aat/",
"bibo": "http://purl.org/ontology/bibo/",
...
"xsd": "http://www.w3.org/2001/XMLSchema#",
}
],
}
A number of prefixes are provided when building the JSON schema. In addition to the provided prefixes, you can create and add your own. In the example below we want to refer to the URI: "https://prefixes.causelesstypos.com/". Let's call the prefix "typos" and add it to the list of prefixes.
{
"@context": [
"https://raw.githubusercontent.com/CLARIAH/COW/master/csvw.json",
{
"@language": "en",
"@base": "https://iisg.amsterdam/"
},
{
...
"xsd": "http://www.w3.org/2001/XMLSchema#",
"typos": "https://prefixes.causelesstypos.com/"
}
],
}
Try changing the prefixes yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the buurt.csv
file. Next, follow these steps:
- Upload the example csv file:
cow_tool build buurt.csv
The tool generates a -metadata.json
file in the folder where you saved the example file.
- Open and edit the metadata file. After line 46 add the following line:
"lic": "http://opendefinition.org/licenses/"
Then change the license id to "lic:cc-by/"
. Make sure to save the changes.
- Create the Linked Data file with the following command:
cow_tool convert buurt.csv
The end result is an .nq
file. When you open this file, you should find "http://opendefinition.org/licenses/cc-by/"
.
When transforming data into Linked Data, it is important to define the data type of each column. Proper data type definitions result in more flexibility when you query the data later on.
Data types are described by the XML Schema, indicated by the xsd:
prefix. Common datatypes are:
-
xsd:string
for text -
xsd:int
for whole numbers below 64k -
xsd:integer
for any whole number -
xsd:float
for numbers with decimals -
xsd:date
for complete dates (YYYY-MM-DD) -
xsd:gYear
for years
By default CoW adds the xsd:
prefix to any data type. CoW also assigns the data type string
to all columns. Below we change the data type of the number of maids by neighbourhood ('Dienstboden') to float
:
{
"name": "Dienstboden",
"datatype": "float",
"@id": "https://iisg.amsterdam/buurt.csv/column/Dienstboden"
}
Try changing the data types yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the buurt.csv
file. Next, follow these steps:
- Upload the example csv file:
cow_tool build buurt.csv
The tool generates a -metadata.json
file in the folder where you saved the example file.
-
Open and edit the metadata file. Change the
datatype
of the column "Dienstboden" from "string" to "float". Make sure to save the changes. -
Create the Linked Data file with the following command:
cow_tool convert buurt.csv
The end result is an .nq
file. When you open this file the second line should contain "1,5"^^<http://www.w3.org/2001/XMLSchema#float>
where #float
refers to the correct data type.
The example for exercises contains the following table:
properties_name_in_uri | Dienstboden |
---|---|
buurt-a | 1,5 |
buurt-b | 2,32 |
buurt-c | 1,96 |
buurt-d | 1,37 |
This table has two columns and four rows. The first column is called "properties_name_in_uri" and the second column is called "Dienstboden". The data are about neighbourhoods (buurt) and the number of maids (Dienstboden) living there.
The JSON schema file represents a column as follows:
{
"name": "properties_name_in_uri",
"datatype": "string",
"@id": "https://iisg.amsterdam/buurt.csv/column/properties_name_in_uri",
"dc:description": "properties_name_in_uri",
"titles": [
"properties_name_in_uri"
]
},
The column "properties_name_in_uri" is referred to by URI "https://iisg.amsterdam/buurt.csv/column/properties_name_in_uri"
(@id). Since the values of this column are texts, the data type is a "string"
. The description, name, and title of the column are "properties_name_in_uri"
. The name refers to the name of the column in the original CSV file.
The description and title can be more specific. Linked Data add information to the data itself.
Try changing the column metadata yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the buurt.csv
file. Next, follow these steps:
- Upload the example csv file.
cow_tool build buurt.csv
The tool generates a -metadata.json
file in the folder where you saved the example file.
- Open and edit the metadata file.
Improve the description and title of the first column as follows:
{
"name": "properties_name_in_uri",
"datatype": "string",
"@id": "https://iisg.amsterdam/buurt.csv/column/properties_name_in_uri",
"dc:description": "Name of neighbourhood as described in the dataset",
"titles": ["Property name of neighbourhood in the URI"]
},
- Create the Linked Data file with the following command:
cow_tool convert buurt.csv
The end result is an .nq
file. The triple on line 47 should now contain <http://purl.org/dc/terms/description> "Name of neighbourhood as described in the dataset"@en
.
Linked Data consists of triples. Triples contain a subject, predicate, and an object (see also introduction). The elements of a triple are defined in the metadata JSON schema by the aboutURL, propertyURL, and valueURL or CSVW:value respectively.
Another notation for triples is ?s ?p ?o
. In SPARQL queries ?s
stands for subject, ?p
for predicate, and o?
for object. For clarity, we've added this notation to the examples. The notation will not show in your example files.
The default JSON schema from our example table describes the column "Dienstboden" (maids) as:
{
"@id": "https://iisg.amsterdam/buurt.csv/column/Dienstboden",
"datatype": "string",
"dc:description": "Dienstboden",
"name": "Dienstboden",
"titles": [
"Dienstboden"
]
},
The results of the JSON schema are triples like:
<https://iisg.amsterdam/26> <https://iisg.amsterdam/vocab/Dienstboden> "1,31"^^<http://www.w3.org/2001/XMLSchema#string>
?s ?p ?o
The subject (?s
) consists of the base URI (https://iisg.amsterdam/
) and the row number. The first row of the CSV file defaults to 0. The row number is set as a default element of the base URI in this part of the JSON schema:
"tableSchema": {
"aboutUrl": "{_row}",
"columns": [
The aboutURL consists of:
- the base URI: defined with
"@base": "https://iisg.amsterdam/",
- and the row number: defined with
"aboutUrl": "{_row}",
under"tableSchema"
The code to add the row number to the aboutURL {_row}
builds on Jinja
. The {}
indicate that CoW needs to execute Jinja. The Jinja code _row
adds the row number to the aboutUrl
.
<https://iisg.amsterdam/0> <https://iisg.amsterdam/vocab/properties_name_in_uri> "buurt-a"^^<http://www.w3.org/2001/XMLSchema#string>
?s ?p ?o
IMPORTANT NOTE
Remember to change the overall aboutUrl
in the tableSchema (see exercise here). At the very least, extend the URI. Make sure to add something unique.
Other unique values in the .csv
files themselves (e.g. an id
) can also be added to the aboutURL. The advantage of using values from the .csv
files is the addition of meaning to the URI. The disadvantage of this approach is the loss of information on the row the information was taken from.
Choose between a substantive URI or a provenance related URI based on the project. For a substantive URI, extend the aboutURL. For provenance information, add unique values from the .csv
file to the aboutURL.
Try changing the aboutURL or subject yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the buurt.csv
file. Next, follow these steps:
- Upload the example csv file:
cow_tool build buurt.csv
The tool generates a -metadata.json
file in the folder where you saved the example file.
- Open and edit the metadata file. First, add your project name to the aboutURL to ensure unique subject URI's:
"tableSchema": {
"aboutUrl": "buurt.csv/{_row}",
Second, remove the primaryKey
and add the unique names of neighbourhoods to the aboutURL:
"tableSchema": {
"aboutUrl": "buurt.csv/{properties_name_in_uri}",
"columns": [
- Create the Linked Data file with the following command:
cow_tool convert buurt.csv
The end result is an .nq
file. When you open this file, the subject of a triple should have changed to:
<https://iisg.amsterdam/buurt.csv/buurt-a>
?s
The predicate is the second element of a triple. A predicate can add information about the subject. To establish that the neighbourhoods ("Buurt") in the example reflect geographical areas, the JSON schema is adapted as follows:
{
"name": "properties_name_in_uri",
"datatype": "string",
"dc:description": "Name of neighbourhood as described in the dataset",
"titles": ["Property name of neighbourhood in the URI"],
"propertyUrl": "rdf:type",
"valueUrl": "sdmx-dimension:refArea",
"@id": "https://iisg.amsterdam/buurt.csv/column/properties_name_in_uri"
},
The results of the JSON schema are triples like:
<https://iisg.amsterdam/buurt.csv/buurt-a> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/sdmx/2009/dimension#refArea>
?s ?p ?o
Note: Check whether the vocabulary is added to the list of prefixes. Otherwise, add "sdmx-dimension": "http://purl.org/linked-data/sdmx/2009/dimension#",
to the list.
The predicate (?p
) is determined by the propertyUrl
and refers to an existing vocabulary (RDF). Using existing vocabularies enhances the exchange of data.
Note: The example above does not specify a subject. Instead, the subject (?s
) is determined by the global aboutUrl
set in the tableSchema
. To specify a different subject, create a virtual column.
When there is no predicate from existing vocabularies, you can create your own. Defining "propertyURL": "vocab/averageNrMaids"
results in triples like:
<https://iisg.amsterdam/buurt.csv/buurt-a> <https://iisg.amsterdam/vocab/averageNrMaids> "1,5"^^<http://www.w3.org/2001/XMLSchema#string>
?s ?p ?o
The predicate (?p
) consists of:
- the base URI: defined with
"@base": "https://iisg.amsterdam/",
- and the
propertyUrl
: defined with"propertyUrl": "vocab/averageNrMaids"
The word "vocab" clarifies that "averageNrMaids" is a term from our personal vocabulary. For a dataset specific vocabulary, precede the propertyUrl
with the name of your dataset "buurt.csv/vocab/averageNrMaids"
.
Try changing the propertyURL or predicate yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the buurt.csv
file. Next, follow these steps:
- Upload the example csv file:
cow_tool build buurt.csv
The tool generates a -metadata.json
file in the folder where you saved the example file.
Note: This exercise builds upon the previous exercise (see subject exercise).
- Open and edit the metadata file. First, add the propertyURL and valueURL to specify neighbourhoods as geographical areas:
{
"name": "properties_name_in_uri",
...,
"propertyUrl": "rdf:type",
"valueUrl": "sdmx-dimension:refArea",
"@id": "https://iisg.amsterdam/buurt.csv/column/properties_name_in_uri"
},
Second, create your own vocabulary for the "Dienstboden" (maids) column:
{
"name": "Dienstboden",
"datatype": "string",
"dc:description": "Dienstboden, presumably average number per household",
"titles": ["Dienstboden"],
"propertyUrl": "vocab/averageNrMaids",
"@id": "https://iisg.amsterdam/buurt.csv/column/Dienstboden"
}
Note: remember to change the data type to float
as well (see data types exercise).
- Create the Linked Data file with the following command:
cow_tool convert buurt.csv
The end result is an .nq
file. The triples on line 2, 5, 7 and 8 should now contain <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/sdmx/2009/dimension#refArea>
.
The object is the third element of a triple. An object can be a URI (valueURL) or a value (CSVW:value). When changing the predicate of the neighbourhoods ("Buurt") to reflect geographical areas, the object was set to a URI with "valueUrl": "sdmx-dimension:refArea"
.
The next example specifies the average number of maids as a value (csvw:value
). Due to a Dutch quirk in the example the average number of maids uses a comma as decimal separator. To change the decimal separator, the JSON-schema is adapted as follows:
{
"name": "Dienstboden",
"datatype": "float",
"dc:description": "Dienstboden",
"titles": ["Dienstboden"],
"propertyUrl": "vocab/averageNrMaids",
"csvw:value": "{{Dienstboden|replace(',', '.')}}",
"@id": "https://iisg.amsterdam/buurt.csv/column/Dienstboden"
}
The results of the JSON schema are triples like:
<https://iisg.amsterdam/buurt.csv/buurt-a> <https://iisg.amsterdam/vocab/averageNrMaids> "1.5"^^<http://www.w3.org/2001/XMLSchema#float>
?s ?p ?o
The object (?o
) is determined by csvw:value
and refers to an existing vocabulary (RDF).
The code to replace the decimal separator {Dienstboden|replace(',', '.')}
builds on Jinja. The {}
indicate that CoW needs to execute Jinja. The Jinja code replace(',', '.')
replaces the decimal separator ,
with .
for the column Dienstboden
.
Remember this important distinction: to create a URI use valueUrl
, to create a value such as a number or a string use csvw:value
.
Try changing the object yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the buurt.csv
file. Next, follow these steps:
- Upload the example csv file:
cow_tool build buurt.csv
The tool generates a -metadata.json
file in the folder where you saved the example file.
- Open and edit the metadata file. Replace the decimal separator of the "Dienstboden" (maids) column when adding the CSVW:value:
{
"name": "Dienstboden",
...,
"propertyUrl": "vocab/averageNrMaids",
"CSVW:value": "{{Dienstboden|replace(',', '.')}}",
"@id": "https://iisg.amsterdam/buurt.csv/column/Dienstboden"
}
Note: remember to change the data type to float
as well (see data types exercise).
- Create the Linked Data file with the following command:
cow_tool convert buurt.csv
The end result is an .nq
file. The triples on line 1, 4, 6 and 7 should now contain the respective value "1.5"^^<http://www.w3.org/2001/XMLSchema#float>
.
Note: with valueURL
you can change an object to a URI (see predicate exercise).
CoW is built on QBer, part of the datalegend ecosystem for historical statistics. For more tools, datasets and infrastructure please visit https://datalegend.net. datalegend is a work package within CLARIAH a back-to-back NWO funded large research facility (#184.033.101)