Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent users from uploading and give warnings during upload if certain basic metadata and Wikitext criteria aren't met #95

Open
trnstlntk opened this issue Sep 3, 2023 · 18 comments · May be fixed by OpenRefine/OpenRefine#7068
Assignees

Comments

@trnstlntk
Copy link
Contributor

I'm checking some recent uploads with OpenRefine to Wikimedia Commons and some users keep these uploads extremely minimal.

This upload for instance doesn't include any structured data and Wikitext except for a license. This is not sufficient according to Wikimedia Commons guidelines; the file should at least have a set of minimal structured data statements and/or a minimal infobox template in Wikitext.

This is clearly explained in the current how-to - but not every user is aware of, or reads, these guidelines.

To prevent users from doing such (too minimal) uploads in the future, OpenRefine should provide warnings and probably even prevent uploads if minimal conditions aren't met.

Checks can be done for

  • Files not having at least the minimally required structured data statements
  • Files without a == {{int:filedesc}} == header and/or without one of the following bits of infobox template wikitext: {{Information}} – default template meant especially for photographs created by users.
    {{Artwork}} – for paintings, artworks and artifacts held by museums and other GLAM institutions
    {{Photograph}} – for photographs held by museums and other GLAM institutions
    {{Art photo}} – for photographs of artworks with more fields for photographer metadata
    {{Book}} – for books
    {{Map}} – for maps
    {{Musical work}} – for music files
@trnstlntk trnstlntk moved this from To be triaged to 2023-24 grant - candidate updates in Structured Data on Commons Sep 3, 2023
@wetneb
Copy link
Member

wetneb commented Sep 3, 2023

We could have this in the main repository, because those checks should be enforced regardless of whether the commons extension is installed, no?

@trnstlntk
Copy link
Contributor Author

trnstlntk commented Sep 3, 2023 via email

@wetneb
Copy link
Member

wetneb commented Sep 3, 2023

Yes I guess those additional checks could be added by the Commons extension instead of the Wikibase one, it's debatable. I'm not sure which proportion of people doing Commons uploads actually have the Commons extension installed, it would be interesting to have some stats about that.

@trnstlntk
Copy link
Contributor Author

trnstlntk commented Sep 3, 2023 via email

@sunilnatraj
Copy link
Contributor

@wetneb Can you assign this to me.

Screenshot 2024-11-12 at 6 58 27 PM

@wetneb
Copy link
Member

wetneb commented Nov 12, 2024

@sunilnatraj thanks for your initiative! The screenshot you have looks good. I expect there will be quite some design work on this issue (what severity for the warnings, how to specify this list of required properties, which text to use…), which would be worth doing in tandem with @Vesihiisi and @sebastian-berlin-wmse, so expect more rounds of back and forth exchanges on this one.

@sunilnatraj
Copy link
Contributor

@Vesihiisi @sebastian-berlin-wmse Do share your inputs on this.

@sunilnatraj
Copy link
Contributor

sunilnatraj commented Nov 13, 2024

@wetneb @sebastian-berlin-wmse @Vesihiisi

The constraints for new media can be defined in the manifest file see snippet below. If the constraints are defined then the validations are carried out for new media entity.

"mediawiki": {
    "name": "Wikidata",
    "root": "https://www.wikidata.org/wiki/",
    "main_page": "https://www.wikidata.org/wiki/Wikidata:Main_Page",
    "api": "https://www.wikidata.org/w/api.php",
    "constraints": {
        "required_properties": "P7482, P571, P170, P6216, P275",
        "wikitext_requires_anyone_infobox_template" : "Information, Artwork, Photography, Art photo, Book, Map, Musical work"
    }
}

@wetneb
Copy link
Member

wetneb commented Nov 13, 2024

Yes, it feels fitting to define the required property ids in the manifest.
Instead of encoding lists as strings, I would use JSON's native ability to represent arrays: ["P7482", "P571", "P170", "P6216", "P275"] instead of "P7482, P571, P170, P6216, P275".
There is then the question of whether that's expressive enough:

  • should all those properties be required with the same severity, or would it be useful to specify the severity level independently? For instance with [{"pid": "P7482", "severity": "critical"}, {"pid": "P571", "severity": "warn"}]
  • is this format expressive enough to represent the requirements on Commons side? For instance, are there any situations where one of two properties should be provided, but not necessarily both?

Concerning the wikitext, I wonder if we can find a reliable and current source of information about the requirements. I am not sure to what extent we can really validate this field, because some of the required parts, such as == {{int:filedesc}} ==, could potentially be added via a template, no? If we do add validation on this part, I would use a rather low severity level, because it's likely that whatever heuristic we use will not work perfectly.

@sunilnatraj
Copy link
Contributor

sunilnatraj commented Nov 13, 2024

@wetneb I reviewed the links in the Issue, as per the definition there are mandatory properties only for a media, which is why i suggested making a list of required/mandatory properties. If we need to also validate optional properties then your suggestion makes sense. The other case you mentioned requires P112 OR P114, if this has to be supported then is this definition available from the wikimedia system?

Wikitext validation - I referred this link as per this there are Infobox templates and the proposal is to verify if Any of the infobox templates is present and not empty.

@sunilnatraj
Copy link
Contributor

sunilnatraj commented Nov 13, 2024

@wetneb a more expansive definition model

"mediawiki": { "name": "Wikidata", "root": "https://www.wikidata.org/wiki/", "main_page": "https://www.wikidata.org/wiki/Wikidata:Main_Page", "api": "https://www.wikidata.org/w/api.php", "constraints": { "validations": { "requiredProperties": { "pids": [ "P170", "P452", "P453" ], "severity": "critical" }, "conditionalProperties": [ { "pids": [ "P121", "P144" ], "severity": "critical" } ], "optionalProperties": { "pids": [ "P542", "P543", "P544" ], "severity": "warn" } }, "wikitext_requires_anyone_infobox_template": [ "Information", "Artwork", "Photography", "Art photo", "Book", "Map", "Musical work" ] } }

@sunilnatraj
Copy link
Contributor

@wetneb @sebastian-berlin-wmse @Vesihiisi Any inputs on the proposed solution

@sunilnatraj
Copy link
Contributor

@wetneb @sebastian-berlin-wmse @Vesihiisi Following up -> Any inputs on the proposed solution

@wetneb
Copy link
Member

wetneb commented Dec 10, 2024

Sorry that it's taking so long to get feedback on this. The problem in this case is that it's quite some work to do the associated research to make sure the solution is fit for purpose, and no one seems to have the time to work on this at the moment. But maybe you can?

You could go to Wikimedia Commons and ask the community for feedback on this. I would try to phrase it in terms that are understandable without deep knowledge of OpenRefine, such as:

As you may know, OpenRefine lets users upload media files to Commons in batch. Because some of the uploads done in this way add too little metadata to the uploaded files, we are considering introducing more pre-upload checks to prevent that. We need your help to determine which metadata fields should be required for any file uploaded via OpenRefine. Are these guidelines still up to date and accurate?
Based on this information, we would require the users to provide:

  • A caption
  • An "inception (P571)" statement
  • A "source of file (P7482)" statement
  • A "creator (P170)" statement
  • A "copyright status (P6216)" statement

We would not require "copyright license (P275)" as this statement is not required for works in the public domain, and we don't anticipate being able to be able to express this conditional dependency.
We also looked into adding constraints on the wikitext associated to the media files but this is likely too complicated to implement reliably, as some required parts could be added via different sorts of templates, which OpenRefine isn't able to expand before upload.

What do you think of this plan? Can you think of any case where it would be fine to upload a file without one of the 5 fields mentioned above? Do you think OpenRefine should only warn the user about those missing fields, or even prevent the upload entirely if those fields are not provided?

You could try to write a message like this one (feel free to copy all/parts of it) and send it to the Wikimedia Commons community. Good places for it could be:

It probably makes sense to post it in multiple places to maximize the chances of getting informed feedback.

@sunilnatraj
Copy link
Contributor

sunilnatraj commented Jan 6, 2025

Feedback from Village pump - https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2024/12#OpenRefine_-_Commons_upload_validations

  • inception and caption should not be required.
  • Simply giving suggestions and examples during the setup, might already help to combat that particular issue.

@sunilnatraj
Copy link
Contributor

sunilnatraj commented Jan 6, 2025

Optional

  1. A caption
  2. An inception (P571) statement

Mandatory
4. A source of file (P7482) statement
5. A creator (P170) statement
6. A copyright status (P6216) statement

@wetneb
Copy link
Member

wetneb commented Jan 6, 2025

Great! Maybe we could:

  • add a check for new mediainfo entities without captions, generating a warning message (not critical) - NewEntityScrutinizer is probably a fitting place for this. This should not need any changes to the manifest
  • introduce a list of default media info properties in the manifest (maybe something like "defaultMediaInfoPropertyIds": ["P7482", "P170", "P6216"]), that could be added directly in the schema when people create a new mediainfo section in it (maybe with only the mandatory ones). We could also consider checking for their presence in a scrutinizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 2023-24 grant - candidates for (bug) fixes
Development

Successfully merging a pull request may close this issue.

3 participants