Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script and instructions to load CORD-19 citation data into Neo4j #176

Merged
merged 4 commits into from
Jun 8, 2020

Conversation

mrkarezina
Copy link
Contributor

The script prepares the csv files that the cypher queries use to load the citation data into Neo4j.

Originally used py2neo to load the articles directly from the script but that was very slow.

Currently the only reliable citation data is titles. The script initially creates nodes for the articles in the dataset, and then merges cited articles (creates new nodes if the title is new). This leaves the possibility of there being duplicate titles that have the same node. We could try to use citation metadata to separate these but the additional citation metadata is unreliable.

A quick script shows there's a total of 59839 articles and 57099 after removing duplicate titles.

@mrkarezina mrkarezina changed the title Script and instructions to loading CORD-19 citation data into Neo4j Script and instructions to load CORD-19 citation data into Neo4j Jun 8, 2020
@lintool lintool self-requested a review June 8, 2020 23:15
@lintool lintool merged commit 274d774 into castorini:master Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants