Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🧬 💿 Added the biokg dataset from Walsh et al 2020 #585

Merged
merged 5 commits into from
Sep 1, 2021

Conversation

sbonner0
Copy link
Contributor

@sbonner0 sbonner0 commented Aug 30, 2021

This adds the BioKG dataset from https://dl.acm.org/doi/abs/10.1145/3340531.3412776

src/pykeen/datasets/base.py Outdated Show resolved Hide resolved
@cthoyt
Copy link
Member

cthoyt commented Aug 31, 2021

@PyKEEN-bot trigger CI

@cthoyt
Copy link
Member

cthoyt commented Aug 31, 2021

@PyKEEN-bot test please
i am tired sorry for double post

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sbonner0. I tried it using the SingleTabbedDataset and it didn't work, so I did a bit of class hierarchy magic to reuse code from the tar file single tabbed dataset loader to also work on zips. Note that you can actually get away with loading the file without extracting the zip archive.

Once we figure out what's going on with our weird CI setup, we'll merge this one

@cthoyt
Copy link
Member

cthoyt commented Aug 31, 2021

@sbonner0 before we merge you have to assign two emoji to the front of this PR's name

@sbonner0 sbonner0 changed the title Added the biokg dataset from Walsh et al 2020 🧬 💿 Added the biokg dataset from Walsh et al 2020 Sep 1, 2021
@sbonner0
Copy link
Contributor Author

sbonner0 commented Sep 1, 2021

Thanks for your help with this @cthoyt !

@cthoyt cthoyt merged commit 3976d36 into pykeen:master Sep 1, 2021
@sbonner0 sbonner0 deleted the feature/biokg branch September 1, 2021 09:12
@Rodrigo-A-Pereira
Copy link
Contributor

There seems to be a slight mistake in the number of triples of this dataset in the documentation. The number of triples should be 2067998 instead of 105524 (wich is the number of entities).

@cthoyt
Copy link
Member

cthoyt commented Sep 8, 2021

@Rodrigo-A-Pereira thanks for pointing that out. The summary should be as follows:

$ python -m pykeen.datasets.biokg -vv
2021-09-08 12:32:23 INFO     done splitting triples to groups of sizes [1552051, 206800, 206800]
2021-09-08 12:32:23 INFO     [BioKG] done splitting data from /Users/cthoyt/.data/pykeen/datasets/biokg/biokg.zip
BioKG (create_inverse_triples=False)
Name        Entities    Relations      Triples
----------  ----------  -----------  ---------
Training    105524      17             1654397
Testing     105524      17              206800
Validation  105524      17              206800
Total       -           -              2067997
Head        Relation                     tail
----------  ---------------------------  -------------
A0A075B6P5  PROTEIN_PATHWAY_ASSOCIATION  R-HSA-983695
A0A075B6S6  PROTEIN_PATHWAY_ASSOCIATION  R-HSA-983695
A0A078BQP2  PROTEIN_PATHWAY_ASSOCIATION  R-CEL-2514859
A0A087WPF7  PROTEIN_PATHWAY_ASSOCIATION  R-MMU-8939243
A0A087X1C5  PROTEIN_PATHWAY_ASSOCIATION  hsa04726

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants