-
-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🧬 💿 Added the biokg dataset from Walsh et al 2020 #585
Conversation
@PyKEEN-bot trigger CI |
@PyKEEN-bot test please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sbonner0. I tried it using the SingleTabbedDataset and it didn't work, so I did a bit of class hierarchy magic to reuse code from the tar file single tabbed dataset loader to also work on zips. Note that you can actually get away with loading the file without extracting the zip archive.
Once we figure out what's going on with our weird CI setup, we'll merge this one
@sbonner0 before we merge you have to assign two emoji to the front of this PR's name |
Thanks for your help with this @cthoyt ! |
There seems to be a slight mistake in the number of triples of this dataset in the documentation. The number of triples should be 2067998 instead of 105524 (wich is the number of entities). |
@Rodrigo-A-Pereira thanks for pointing that out. The summary should be as follows: $ python -m pykeen.datasets.biokg -vv
2021-09-08 12:32:23 INFO done splitting triples to groups of sizes [1552051, 206800, 206800]
2021-09-08 12:32:23 INFO [BioKG] done splitting data from /Users/cthoyt/.data/pykeen/datasets/biokg/biokg.zip
BioKG (create_inverse_triples=False)
Name Entities Relations Triples
---------- ---------- ----------- ---------
Training 105524 17 1654397
Testing 105524 17 206800
Validation 105524 17 206800
Total - - 2067997
Head Relation tail
---------- --------------------------- -------------
A0A075B6P5 PROTEIN_PATHWAY_ASSOCIATION R-HSA-983695
A0A075B6S6 PROTEIN_PATHWAY_ASSOCIATION R-HSA-983695
A0A078BQP2 PROTEIN_PATHWAY_ASSOCIATION R-CEL-2514859
A0A087WPF7 PROTEIN_PATHWAY_ASSOCIATION R-MMU-8939243
A0A087X1C5 PROTEIN_PATHWAY_ASSOCIATION hsa04726 |
This adds the BioKG dataset from https://dl.acm.org/doi/abs/10.1145/3340531.3412776