publications in asset metadata #1308

bendichter · 2023-07-03T15:18:44Z

In dandi/helpdesk#100, @GaelleChapuis wanted to know how to query for NWB files within the big IBL dandiset that are associated with a given publication. This dandiset contains data that are relevant to several different publications, and each NWB file lists which publications it is associated with under the field nwbfile.related_publications. It looks like the dandi cli currently extracts these values and adds them to the dandiset-level metadata, but not the asset-level metadata. As a result, our best approach to query for these files currently is it iterate over all of the NWB files in the dandiset, open them in stream mode, and read this value. This is very slow compared to iterating over the asset-level metadata.

The solution I would like is to use these extracted values to also populate the asset-level metadata so they can be differentiated by this attribute. Since RelatedResource is part of the CommonModel that Asset inherits, it looks like the schema can support this already. It's just a matter of having the dandi cli extract this information. Note that NWB files do not always contain reference to publications, and they should not be required to, since sometimes data is not associated with any publication, so this extraction logic should be robust to the case where nwbfile.related_publication is None.

The text was updated successfully, but these errors were encountered:

satra · 2023-07-03T15:30:57Z

i would suggest creating a function that handles mapping NWB metadata to dandi schema metadata. currently this is all done in the CLI, and reflects adaptations to NWB validation (e.g. where we wanted URIs and details like cell lines). however, the ideal thing would be a mapping function. this could exist in nwbinspector which already handles dandi specific components instead of the CLI, but i leave this to the CLI folks.

bendichter · 2023-07-03T15:50:16Z

That could work. We could create a function for mapping asset-level NWB metadata to the dandi schema for assets if the dandi cli team would be interested in incorporating this function

yarikoptic · 2023-07-05T17:58:38Z

To say I am not quite following why it should reside in nwbinspector which job is to inspect, not to map etc, although I do see that "inspection" could be carried out on a "mapped/converted" instance of the metadata I guess. But I don't think that is what is happening anyways ATM.

Here it seems indeed just a matter of adding more extracted metadata, which is what we already do in dandi-cli so I think it should just be done in it.

satra · 2023-07-05T20:59:50Z

@yarikoptic - only suggested it because there was always the lingering notion of removing some of those dictionaries and mappings out the dandi library, but i leave it as your call.

jwodder · 2023-11-20T19:43:01Z

@bendichter When you say the field in question is nwbfile.related_publications, do you mean:

The field is keyed by "related_publications" inside an NWB file
The field is keyed by "nwbfile.related_publications" inside an NWB file
The field is keyed by "related_publications" inside a mapping keyed by "nwbfile" inside an NWB file

Also, could you link me to an example NWB that contains this field?

jwodder · 2023-11-20T19:49:41Z

@bendichter

It looks like the dandi cli currently extracts these values and adds them to the dandiset-level metadata

I don't think that's true. The only times when dandi-cli sets the metadata for a Dandiset in the Archive are (a) when running dandi upload — but that's only if the developer option --upload-dandiset-metadata is given, and then only the metadata in the dandiset.yaml file is sent to the Archive — and (b) when running dandi service-scripts update-dandiset-from-doi. Are you sure it's not the Archive that's removing the values from asset metadata and adding them to Dandiset metadata?

jwodder · 2023-11-29T13:42:57Z

@bendichter Ping.

jwodder · 2023-12-06T13:53:27Z

@bendichter Ping.

bendichter · 2023-12-06T14:31:21Z

@jwodder sorry for the delay

Here is an example file:

details: https://api.dandiarchive.org/api/dandisets/000409/versions/draft/assets/10762cd8-94b3-4691-8bd0-d690f7890bc6/
neurosift page where you can see the related publications field for this asset: https://flatironinstitute.github.io/neurosift/?p=/nwb&url=https://dandiarchive.s3.amazonaws.com/blobs/48a/312/48a31208-724e-4ee0-89b8-e8fa4fcf12ca&dandisetId=000409&dandisetVersion=draft

By nwbfile.related_publications I mean that for a local NWB file, I would access this field using PyNWB via:

from pynwb import NWBHDF5IO

with NWBHDF5IO("fpath", "r", load_namespaces=True) as io:
    nwbfile = io.read()
    related_publications = nwbfile.related_publications

I was under the impression that this value was extracted by the dandi cli because of this line:

dandi-cli/dandi/organize.py

Line 621 in b9a1099

"related_publications",

and a few other places where related_publications shows up in the dandi cli code.

jwodder · 2023-12-06T15:19:59Z

@bendichter

Could you provide something smaller than 86 GiB? I'm not going to download something that huge, and extracting metadata over the network takes five to ten minutes.
I can confirm that, though the client extracts the related_publications field from NWBs, the field is discarded when converting NWB metadata to dandischema metadata. The client is definitely not including this information in Dandiset metadata.
I'm assuming that the related_publications field in an NWB can be either a sequence of strings or (based on some of our test cases) a single string. Is this correct?
How exactly should related_publications fields be converted to dandischema Resources? For the provided NWB in particular, the field is a list containing a single string containing two DOI URLs separated by a comma & space, and one of these DOI URLs was used as the identifier field in a Resource in the Dandiset metadata. (CC @satra)

bendichter · 2023-12-06T15:36:45Z

@jwodder

Here's one that's < 1GB: https://api.dandiarchive.org/api/dandisets/000409/versions/draft/assets/64c796e4-3935-4d43-b4ca-9d1df17a07c2/

I'm assuming that the related_publications field in an NWB can be either a sequence of strings or (based on some of our test cases) a single string. Is this correct?

Yes, this field can be a list of strings. See the schema entry here. I think in this case neurosift is doing the JS equivalent of ", ".join(related_publications) for the text it is displaying.

How exactly should related_publications fields be converted to dandischema Resources?

I think we should create one Resource object for each entry of related_publications where Resource.identifier is the value and Resource.relation should be "cite:IsDescribedBy".

jwodder · 2023-12-06T15:42:49Z

@bendichter

I think in this case neurosift is doing the JS equivalent of ", ".join(related_publications) for the text it is displaying.

No, our _get_pynwb_metadata() also returns a list of one comma-separated string.

I think we should create one Resource object for each entry of related_publications where Resource.identifier is the value

What about for cases like this where the value isn't a DOI URL (or whatever we require of an identifier)?

bendichter · 2023-12-06T16:38:29Z

No, our _get_pynwb_metadata() also returns a list of one comma-separated string.

Ah, ok, well I would say that's how it's supposed to be added, but we don't do any kind of validation on the value in PyNWB. We can add a check for this in nwbinspector so that we can have more trust in future contributions. @CodyCBakerPhD , do you know why the current IBL data is a comma-separated string instead of a list of string?

CodyCBakerPhD · 2023-12-06T17:25:28Z

We can add a check for this in nwbinspector so that we can have more trust in future contributions.

We've always had a check for DOI format for all entries in the related publication list: https://github.com/NeurodataWithoutBorders/nwbinspector/blob/0e3640048f18838762af98ccbf071b07f0b39e48/src/nwbinspector/checks/nwbfile_metadata.py#L115

@CodyCBakerPhD , do you know why the current IBL data is a comma-separated string instead of a list of string?

Not sure what you're talking about - the metadata encodes it as a list in the YAML: https://github.com/catalystneuro/IBL-to-nwb/blob/5195b4eada1f3c82416cd3f26d523fd72752fadf/ibl_to_nwb/brainwide_map/brainwide_map_metadata.yml#L6-L8

And the HDF5 file contents show each string written as a separate element on that dataset

bendichter · 2023-12-06T17:27:42Z

OK, yes, I thought this was the case. Thanks for the clarification, @CodyCBakerPhD

@jwodder , is it possible that _get_pynwb_metadata() is doing this join behind the scenes?

jwodder · 2023-12-06T18:31:15Z

@CodyCBakerPhD You appear to be looking at the second NWB Ben provided, the related_publications of which is indeed a list of two strings. However, the first NWB Ben provided stores its related_publications as a list containing a single string containing two comma-separated URLs.

@bendichter I see no join in _get_pynwb_metadata(). I tried to print out just the first NWB's related_publications field directly to confirm what it actually looks like, but the script ran for more than thirty minutes without finishing.

CodyCBakerPhD · 2023-12-06T20:51:55Z

@jwodder Right you are - the file with the bulk contents has top level metadata from an earlier draft version that hasn't been updated

We can definetely add an NWB Inspector check for someone adding single string with comma-separated entires instead of multiple listed entries

jwodder · 2023-12-19T15:42:02Z

@bendichter Repeating some questions from earlier:

Can related_publications legitimately be a single string as an alternative to a list of strings? I can't make any sense of the schema entry.
- If it is not allowed for a related_publications field to be a single string, what should happen if the client encounters one that is?

I think we should create one Resource object for each entry of related_publications where Resource.identifier is the value and Resource.relation should be "cite:IsDescribedBy".

I'm assuming this is intended for the case where each entry of related_publications is a DOI URL. What if an entry is a different kind of URL? What if it's not a valid URL at all (as in the case of the two URLs joined by a comma)?

jwodder · 2024-01-02T13:31:17Z

@bendichter Ping.

bendichter · 2024-01-04T16:12:49Z

@jwodder "related publications" is a list of strings, and each string should be a DOI. It usually only has one entry (and is often empty). If there are multiple related publications, the proper way of entering that would be as a list of DOI strings. We do not do any validation to ensure users are not entering related publications as "[first_pub], [second_pub]" or any other kind of custom concatenation, so I do suspect we might run into that every now and then. We can make an NWB Inspector check that minimizes this. Another approach could be to test the validity of each entry as a URL and only accept those that are proper URLs, or proper DOIs, depending on how strict we want to be on the DANDI side.

jwodder · 2024-01-04T16:24:30Z

@bendichter You didn't answer the crucial question, which is: What should the code do when "related publications" isn't what you say it should be?

bendichter · 2024-01-04T16:28:10Z

Skip it

jwodder · 2024-01-04T18:56:28Z

The need to check for valid URLs makes this blocked on #1380.

Convert DOI URLs in `related_publications` to related resources

github-actions · 2024-05-03T13:57:44Z

🚀 Issue was released in 0.62.0 🚀

yarikoptic added the easy label Jul 5, 2023

yarikoptic assigned jwodder Jul 5, 2023

yarikoptic-gitmate assigned yarikoptic-gitmate and unassigned jwodder Aug 10, 2023

jwodder assigned jwodder and unassigned yarikoptic-gitmate Nov 20, 2023

jwodder added the awaiting-user-response label Nov 30, 2023

CodyCBakerPhD mentioned this issue Dec 6, 2023

[Add Check]: Comma-separated entries in releated_publication should be list of entries NeurodataWithoutBorders/nwbinspector#419

Open

2 tasks

jwodder added blocked Blocked by some needed development/fix and removed awaiting-user-response labels Jan 4, 2024

jwodder mentioned this issue Jan 4, 2024

Use a dedicated URL-handling library #1380

Closed

5 tasks

jwodder removed the blocked Blocked by some needed development/fix label Mar 7, 2024

jwodder mentioned this issue Mar 7, 2024

Convert DOI URLs in related_publications to related resources #1417

Merged

yarikoptic closed this as completed in #1417 Mar 25, 2024

yarikoptic added a commit that referenced this issue Mar 25, 2024

Merge pull request #1417 from dandi/gh-1308

d5e2a76

Convert DOI URLs in `related_publications` to related resources

github-actions bot added the released label May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

publications in asset metadata #1308

publications in asset metadata #1308

bendichter commented Jul 3, 2023 •

edited

Loading

satra commented Jul 3, 2023

bendichter commented Jul 3, 2023

yarikoptic commented Jul 5, 2023

satra commented Jul 5, 2023

jwodder commented Nov 20, 2023

jwodder commented Nov 20, 2023

jwodder commented Nov 29, 2023

jwodder commented Dec 6, 2023

bendichter commented Dec 6, 2023

jwodder commented Dec 6, 2023

bendichter commented Dec 6, 2023

jwodder commented Dec 6, 2023

bendichter commented Dec 6, 2023

CodyCBakerPhD commented Dec 6, 2023

bendichter commented Dec 6, 2023

jwodder commented Dec 6, 2023

CodyCBakerPhD commented Dec 6, 2023

jwodder commented Dec 19, 2023

jwodder commented Jan 2, 2024

bendichter commented Jan 4, 2024 •

edited

Loading

jwodder commented Jan 4, 2024

bendichter commented Jan 4, 2024

jwodder commented Jan 4, 2024

github-actions bot commented May 3, 2024

publications in asset metadata #1308

publications in asset metadata #1308

Comments

bendichter commented Jul 3, 2023 • edited Loading

satra commented Jul 3, 2023

bendichter commented Jul 3, 2023

yarikoptic commented Jul 5, 2023

satra commented Jul 5, 2023

jwodder commented Nov 20, 2023

jwodder commented Nov 20, 2023

jwodder commented Nov 29, 2023

jwodder commented Dec 6, 2023

bendichter commented Dec 6, 2023

jwodder commented Dec 6, 2023

bendichter commented Dec 6, 2023

jwodder commented Dec 6, 2023

bendichter commented Dec 6, 2023

CodyCBakerPhD commented Dec 6, 2023

bendichter commented Dec 6, 2023

jwodder commented Dec 6, 2023

CodyCBakerPhD commented Dec 6, 2023

jwodder commented Dec 19, 2023

jwodder commented Jan 2, 2024

bendichter commented Jan 4, 2024 • edited Loading

jwodder commented Jan 4, 2024

bendichter commented Jan 4, 2024

jwodder commented Jan 4, 2024

github-actions bot commented May 3, 2024

bendichter commented Jul 3, 2023 •

edited

Loading

bendichter commented Jan 4, 2024 •

edited

Loading