Script and instructions to load CORD-19 citation data into Neo4j #176
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The script prepares the csv files that the cypher queries use to load the citation data into Neo4j.
Originally used
py2neo
to load the articles directly from the script but that was very slow.Currently the only reliable citation data is titles. The script initially creates nodes for the articles in the dataset, and then merges cited articles (creates new nodes if the title is new). This leaves the possibility of there being duplicate titles that have the same node. We could try to use citation metadata to separate these but the additional citation metadata is unreliable.
A quick script shows there's a total of 59839 articles and 57099 after removing duplicate titles.