This is a suite of scripts and models that refactor the metadata set of the Project Gutenberg digital library, with the aim of turning it into a proper Linked Data set.
- Reconciliation of blank nodes, resulting in a much smaller dataset (~29% smaller as of March 2020)
- Linking with Library of Congress subject headings and classification systems
- Structuring of Table Of Contents data
- Ontology alignment of undocumented Gutenberg terms
You need:
- Python 3
- an RDF store with SPARQL querying/updating over HTTP (e.g. Jena Fuseki, Virtuoso, BlazeGraph)
- The Project Gutenberg catalog as RDF - Download at https://www.gutenberg.org/wiki/Gutenberg:Feeds
- Download the metadata set from Gutenberg and load it onto your RDF store.
cd gutenberg-fixes
- In
settings.py
set the SPARQL service and RDF graph name python refactor.py bookshelves formats toc
(or a subset of the three arguments)- in gutenberg-fixes/queries you can find other SPARQL queries to run by yourselves.
Gutenberg-LD is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.