Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

publications in asset metadata #1308

Closed
bendichter opened this issue Jul 3, 2023 · 24 comments · Fixed by #1417
Closed

publications in asset metadata #1308

bendichter opened this issue Jul 3, 2023 · 24 comments · Fixed by #1417
Assignees

Comments

@bendichter
Copy link
Member

bendichter commented Jul 3, 2023

In dandi/helpdesk#100, @GaelleChapuis wanted to know how to query for NWB files within the big IBL dandiset that are associated with a given publication. This dandiset contains data that are relevant to several different publications, and each NWB file lists which publications it is associated with under the field nwbfile.related_publications. It looks like the dandi cli currently extracts these values and adds them to the dandiset-level metadata, but not the asset-level metadata. As a result, our best approach to query for these files currently is it iterate over all of the NWB files in the dandiset, open them in stream mode, and read this value. This is very slow compared to iterating over the asset-level metadata.

The solution I would like is to use these extracted values to also populate the asset-level metadata so they can be differentiated by this attribute. Since RelatedResource is part of the CommonModel that Asset inherits, it looks like the schema can support this already. It's just a matter of having the dandi cli extract this information. Note that NWB files do not always contain reference to publications, and they should not be required to, since sometimes data is not associated with any publication, so this extraction logic should be robust to the case where nwbfile.related_publication is None.

@satra
Copy link
Member

satra commented Jul 3, 2023

i would suggest creating a function that handles mapping NWB metadata to dandi schema metadata. currently this is all done in the CLI, and reflects adaptations to NWB validation (e.g. where we wanted URIs and details like cell lines). however, the ideal thing would be a mapping function. this could exist in nwbinspector which already handles dandi specific components instead of the CLI, but i leave this to the CLI folks.

@bendichter
Copy link
Member Author

That could work. We could create a function for mapping asset-level NWB metadata to the dandi schema for assets if the dandi cli team would be interested in incorporating this function

@yarikoptic
Copy link
Member

To say I am not quite following why it should reside in nwbinspector which job is to inspect, not to map etc, although I do see that "inspection" could be carried out on a "mapped/converted" instance of the metadata I guess. But I don't think that is what is happening anyways ATM.

Here it seems indeed just a matter of adding more extracted metadata, which is what we already do in dandi-cli so I think it should just be done in it.

@satra
Copy link
Member

satra commented Jul 5, 2023

@yarikoptic - only suggested it because there was always the lingering notion of removing some of those dictionaries and mappings out the dandi library, but i leave it as your call.

@jwodder
Copy link
Member

jwodder commented Nov 20, 2023

@bendichter When you say the field in question is nwbfile.related_publications, do you mean:

  • The field is keyed by "related_publications" inside an NWB file
  • The field is keyed by "nwbfile.related_publications" inside an NWB file
  • The field is keyed by "related_publications" inside a mapping keyed by "nwbfile" inside an NWB file

Also, could you link me to an example NWB that contains this field?

@jwodder
Copy link
Member

jwodder commented Nov 20, 2023

@bendichter

It looks like the dandi cli currently extracts these values and adds them to the dandiset-level metadata

I don't think that's true. The only times when dandi-cli sets the metadata for a Dandiset in the Archive are (a) when running dandi upload — but that's only if the developer option --upload-dandiset-metadata is given, and then only the metadata in the dandiset.yaml file is sent to the Archive — and (b) when running dandi service-scripts update-dandiset-from-doi. Are you sure it's not the Archive that's removing the values from asset metadata and adding them to Dandiset metadata?

@jwodder
Copy link
Member

jwodder commented Nov 29, 2023

@bendichter Ping.

@jwodder
Copy link
Member

jwodder commented Dec 6, 2023

@bendichter Ping.

@bendichter
Copy link
Member Author

@jwodder sorry for the delay

Here is an example file:

By nwbfile.related_publications I mean that for a local NWB file, I would access this field using PyNWB via:

from pynwb import NWBHDF5IO

with NWBHDF5IO("fpath", "r", load_namespaces=True) as io:
    nwbfile = io.read()
    related_publications = nwbfile.related_publications

I was under the impression that this value was extracted by the dandi cli because of this line:

"related_publications",

and a few other places where related_publications shows up in the dandi cli code.

@jwodder
Copy link
Member

jwodder commented Dec 6, 2023

@bendichter

  • Could you provide something smaller than 86 GiB? I'm not going to download something that huge, and extracting metadata over the network takes five to ten minutes.
  • I can confirm that, though the client extracts the related_publications field from NWBs, the field is discarded when converting NWB metadata to dandischema metadata. The client is definitely not including this information in Dandiset metadata.
  • I'm assuming that the related_publications field in an NWB can be either a sequence of strings or (based on some of our test cases) a single string. Is this correct?
  • How exactly should related_publications fields be converted to dandischema Resources? For the provided NWB in particular, the field is a list containing a single string containing two DOI URLs separated by a comma & space, and one of these DOI URLs was used as the identifier field in a Resource in the Dandiset metadata. (CC @satra)

@bendichter
Copy link
Member Author

@jwodder

Here's one that's < 1GB: https://api.dandiarchive.org/api/dandisets/000409/versions/draft/assets/64c796e4-3935-4d43-b4ca-9d1df17a07c2/

I'm assuming that the related_publications field in an NWB can be either a sequence of strings or (based on some of our test cases) a single string. Is this correct?

Yes, this field can be a list of strings. See the schema entry here. I think in this case neurosift is doing the JS equivalent of ", ".join(related_publications) for the text it is displaying.

How exactly should related_publications fields be converted to dandischema Resources?

I think we should create one Resource object for each entry of related_publications where Resource.identifier is the value and Resource.relation should be "cite:IsDescribedBy".

@jwodder
Copy link
Member

jwodder commented Dec 6, 2023

@bendichter

I think in this case neurosift is doing the JS equivalent of ", ".join(related_publications) for the text it is displaying.

No, our _get_pynwb_metadata() also returns a list of one comma-separated string.

I think we should create one Resource object for each entry of related_publications where Resource.identifier is the value

What about for cases like this where the value isn't a DOI URL (or whatever we require of an identifier)?

@bendichter
Copy link
Member Author

No, our _get_pynwb_metadata() also returns a list of one comma-separated string.

Ah, ok, well I would say that's how it's supposed to be added, but we don't do any kind of validation on the value in PyNWB. We can add a check for this in nwbinspector so that we can have more trust in future contributions. @CodyCBakerPhD , do you know why the current IBL data is a comma-separated string instead of a list of string?

@CodyCBakerPhD
Copy link
Contributor

We can add a check for this in nwbinspector so that we can have more trust in future contributions.

We've always had a check for DOI format for all entries in the related publication list: https://github.com/NeurodataWithoutBorders/nwbinspector/blob/0e3640048f18838762af98ccbf071b07f0b39e48/src/nwbinspector/checks/nwbfile_metadata.py#L115

@CodyCBakerPhD , do you know why the current IBL data is a comma-separated string instead of a list of string?

Not sure what you're talking about - the metadata encodes it as a list in the YAML: https://github.com/catalystneuro/IBL-to-nwb/blob/5195b4eada1f3c82416cd3f26d523fd72752fadf/ibl_to_nwb/brainwide_map/brainwide_map_metadata.yml#L6-L8

And the HDF5 file contents show each string written as a separate element on that dataset

image

@bendichter
Copy link
Member Author

OK, yes, I thought this was the case. Thanks for the clarification, @CodyCBakerPhD

@jwodder , is it possible that _get_pynwb_metadata() is doing this join behind the scenes?

@jwodder
Copy link
Member

jwodder commented Dec 6, 2023

@CodyCBakerPhD You appear to be looking at the second NWB Ben provided, the related_publications of which is indeed a list of two strings. However, the first NWB Ben provided stores its related_publications as a list containing a single string containing two comma-separated URLs.

@bendichter I see no join in _get_pynwb_metadata(). I tried to print out just the first NWB's related_publications field directly to confirm what it actually looks like, but the script ran for more than thirty minutes without finishing.

@CodyCBakerPhD
Copy link
Contributor

@jwodder Right you are - the file with the bulk contents has top level metadata from an earlier draft version that hasn't been updated

We can definetely add an NWB Inspector check for someone adding single string with comma-separated entires instead of multiple listed entries

@jwodder
Copy link
Member

jwodder commented Dec 19, 2023

@bendichter Repeating some questions from earlier:

  • Can related_publications legitimately be a single string as an alternative to a list of strings? I can't make any sense of the schema entry.
    • If it is not allowed for a related_publications field to be a single string, what should happen if the client encounters one that is?

I think we should create one Resource object for each entry of related_publications where Resource.identifier is the value and Resource.relation should be "cite:IsDescribedBy".

I'm assuming this is intended for the case where each entry of related_publications is a DOI URL. What if an entry is a different kind of URL? What if it's not a valid URL at all (as in the case of the two URLs joined by a comma)?

@jwodder
Copy link
Member

jwodder commented Jan 2, 2024

@bendichter Ping.

@bendichter
Copy link
Member Author

bendichter commented Jan 4, 2024

@jwodder "related publications" is a list of strings, and each string should be a DOI. It usually only has one entry (and is often empty). If there are multiple related publications, the proper way of entering that would be as a list of DOI strings. We do not do any validation to ensure users are not entering related publications as "[first_pub], [second_pub]" or any other kind of custom concatenation, so I do suspect we might run into that every now and then. We can make an NWB Inspector check that minimizes this. Another approach could be to test the validity of each entry as a URL and only accept those that are proper URLs, or proper DOIs, depending on how strict we want to be on the DANDI side.

@jwodder
Copy link
Member

jwodder commented Jan 4, 2024

@bendichter You didn't answer the crucial question, which is: What should the code do when "related publications" isn't what you say it should be?

@bendichter
Copy link
Member Author

Skip it

@jwodder
Copy link
Member

jwodder commented Jan 4, 2024

The need to check for valid URLs makes this blocked on #1380.

@jwodder jwodder added blocked Blocked by some needed development/fix and removed awaiting-user-response labels Jan 4, 2024
@jwodder jwodder removed the blocked Blocked by some needed development/fix label Mar 7, 2024
yarikoptic added a commit that referenced this issue Mar 25, 2024
Convert DOI URLs in `related_publications` to related resources
Copy link

github-actions bot commented May 3, 2024

🚀 Issue was released in 0.62.0 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants