-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
publications in asset metadata #1308
Comments
i would suggest creating a function that handles mapping NWB metadata to dandi schema metadata. currently this is all done in the CLI, and reflects adaptations to NWB validation (e.g. where we wanted URIs and details like cell lines). however, the ideal thing would be a mapping function. this could exist in nwbinspector which already handles dandi specific components instead of the CLI, but i leave this to the CLI folks. |
That could work. We could create a function for mapping asset-level NWB metadata to the dandi schema for assets if the dandi cli team would be interested in incorporating this function |
To say I am not quite following why it should reside in nwbinspector which job is to inspect, not to map etc, although I do see that "inspection" could be carried out on a "mapped/converted" instance of the metadata I guess. But I don't think that is what is happening anyways ATM. Here it seems indeed just a matter of adding more extracted metadata, which is what we already do in dandi-cli so I think it should just be done in it. |
@yarikoptic - only suggested it because there was always the lingering notion of removing some of those dictionaries and mappings out the dandi library, but i leave it as your call. |
@bendichter When you say the field in question is
Also, could you link me to an example NWB that contains this field? |
I don't think that's true. The only times when dandi-cli sets the metadata for a Dandiset in the Archive are (a) when running |
@bendichter Ping. |
@bendichter Ping. |
@jwodder sorry for the delay Here is an example file:
By from pynwb import NWBHDF5IO
with NWBHDF5IO("fpath", "r", load_namespaces=True) as io:
nwbfile = io.read()
related_publications = nwbfile.related_publications I was under the impression that this value was extracted by the dandi cli because of this line: Line 621 in b9a1099
and a few other places where related_publications shows up in the dandi cli code. |
|
Here's one that's < 1GB: https://api.dandiarchive.org/api/dandisets/000409/versions/draft/assets/64c796e4-3935-4d43-b4ca-9d1df17a07c2/
Yes, this field can be a list of strings. See the schema entry here. I think in this case neurosift is doing the JS equivalent of
I think we should create one |
No, our
What about for cases like this where the value isn't a DOI URL (or whatever we require of an identifier)? |
Ah, ok, well I would say that's how it's supposed to be added, but we don't do any kind of validation on the value in PyNWB. We can add a check for this in nwbinspector so that we can have more trust in future contributions. @CodyCBakerPhD , do you know why the current IBL data is a comma-separated string instead of a list of string? |
We've always had a check for DOI format for all entries in the related publication list: https://github.com/NeurodataWithoutBorders/nwbinspector/blob/0e3640048f18838762af98ccbf071b07f0b39e48/src/nwbinspector/checks/nwbfile_metadata.py#L115
Not sure what you're talking about - the metadata encodes it as a list in the YAML: https://github.com/catalystneuro/IBL-to-nwb/blob/5195b4eada1f3c82416cd3f26d523fd72752fadf/ibl_to_nwb/brainwide_map/brainwide_map_metadata.yml#L6-L8 And the HDF5 file contents show each string written as a separate element on that dataset |
OK, yes, I thought this was the case. Thanks for the clarification, @CodyCBakerPhD @jwodder , is it possible that |
@CodyCBakerPhD You appear to be looking at the second NWB Ben provided, the @bendichter I see no join in |
@jwodder Right you are - the file with the bulk contents has top level metadata from an earlier draft version that hasn't been updated We can definetely add an NWB Inspector check for someone adding single string with comma-separated entires instead of multiple listed entries |
@bendichter Repeating some questions from earlier:
I'm assuming this is intended for the case where each entry of |
@bendichter Ping. |
@jwodder "related publications" is a list of strings, and each string should be a DOI. It usually only has one entry (and is often empty). If there are multiple related publications, the proper way of entering that would be as a list of DOI strings. We do not do any validation to ensure users are not entering related publications as "[first_pub], [second_pub]" or any other kind of custom concatenation, so I do suspect we might run into that every now and then. We can make an NWB Inspector check that minimizes this. Another approach could be to test the validity of each entry as a URL and only accept those that are proper URLs, or proper DOIs, depending on how strict we want to be on the DANDI side. |
@bendichter You didn't answer the crucial question, which is: What should the code do when "related publications" isn't what you say it should be? |
Skip it |
The need to check for valid URLs makes this blocked on #1380. |
Convert DOI URLs in `related_publications` to related resources
🚀 Issue was released in |
In dandi/helpdesk#100, @GaelleChapuis wanted to know how to query for NWB files within the big IBL dandiset that are associated with a given publication. This dandiset contains data that are relevant to several different publications, and each NWB file lists which publications it is associated with under the field
nwbfile.related_publications
. It looks like the dandi cli currently extracts these values and adds them to the dandiset-level metadata, but not the asset-level metadata. As a result, our best approach to query for these files currently is it iterate over all of the NWB files in the dandiset, open them in stream mode, and read this value. This is very slow compared to iterating over the asset-level metadata.The solution I would like is to use these extracted values to also populate the asset-level metadata so they can be differentiated by this attribute. Since
RelatedResource
is part of theCommonModel
thatAsset
inherits, it looks like the schema can support this already. It's just a matter of having the dandi cli extract this information. Note that NWB files do not always contain reference to publications, and they should not be required to, since sometimes data is not associated with any publication, so this extraction logic should be robust to the case wherenwbfile.related_publication is None
.The text was updated successfully, but these errors were encountered: