Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error 408 when consolidating citations #54

Open
LukasWallrich opened this issue Oct 4, 2022 · 7 comments
Open

Error 408 when consolidating citations #54

LukasWallrich opened this issue Oct 4, 2022 · 7 comments

Comments

@LukasWallrich
Copy link

The following file can be processed by grobid when called with curl, but the equivalent (?) Python command fails with error 408.

meta-chinese.pdf

curl call: curl -v --form input=@./meta-chinese.pdf --form consolidateCitations=1 localhost:8070/api/processReferences
python: client.process("processReferences", "./screened_PDF", consolidate_citations=True)

The server log does not show any obvious issues. The python command works when I don't consolidate citations.

Any ideas / suggestions?

@LukasWallrich
Copy link
Author

LukasWallrich commented Oct 4, 2022

Just to add - a basic requests.post call works from Python. I can't quite see what the client is doing differently ...

import requests
GROBID_URL = 'http://localhost:8070'
url = '%s/api/processReferences' % GROBID_URL
pdf = './screened_PDF/meta-chinese.pdf'
xml = requests.post(url, files={'input': open(pdf, 'rb')}, data = {"consolidateCitations": "1"})

@lfoppiano
Copy link
Collaborator

lfoppiano commented Oct 4, 2022

@LukasWallrich, the input_path should be a directory. Indeed, this is a bug, as the client should say something about it. Single files can be processed by calling process_pdf. I'm not sure if process_pdf is meant to be called like that, though.

@lfoppiano lfoppiano added the bug Something isn't working label Oct 4, 2022
@kermitt2 kermitt2 removed the bug Something isn't working label Oct 5, 2022
@kermitt2
Copy link
Owner

kermitt2 commented Oct 5, 2022

Hello !

The purpose of this client is to process a directory of files, so to do a batch process, managing concurrency efficiently. I tried to made it explicit from the readme and from the --help:

--input INPUT         path to the directory containing PDF files or .txt
                        (for processCitationList only, one reference per line)
                        to process
  --output OUTPUT       path to the directory where to put the results
                        (optional)

If you want to process a single PDF file, you can use client.process_pdf(), but as Luca said, it's not written to be used like that outside a batch process, all the arguments must be provided.

@LukasWallrich
Copy link
Author

Thank you both! The input here is a folder with two files - the other one works fine. So that does not seem to be the issue.

@kermitt2
Copy link
Owner

kermitt2 commented Oct 5, 2022

If it's 408 timeout, it might be simply that crossref API is too slow to consolidate citations. But for 2 files, it means the crossref API is very very slow. You can improve the response time a bit by indicating your email in the Grobid config file (the "polite" usage):
https://grobid.readthedocs.io/en/latest/Consolidation/#crossref-rest-api

However, sometimes when it is not in good shape, the Crossref API takes several seconds to answer each requests. With many references, the timeout might be reached (60 seconds). Even with a Plus token, this can happen.

For production, it's not really possible to use Crossref web API, which is why biblio-glutton was developed.

@LukasWallrich
Copy link
Author

Thanks. Adding the email is a bit difficult as I am on an M2 mac and can thus only run grobid in the Docker container, which is hard to edit. Anyway, the request through the client fails even when there is only one PDF in the folder, while the manual Python request works. Also, the server log shows that crossref request go through every second or so ... so there might be something more specific going on.

For my use case, I only need to process a couple of hundred PDFs, so I can go down the more manual route, but obviously, the client would be helpful ...

@kermitt2
Copy link
Owner

kermitt2 commented Oct 9, 2022

Adding the email is a bit difficult as I am on an M2 mac and can thus only run grobid in the Docker container, which is hard to edit.

You don't need to edit the container, simply edit the config file and mount it at launch of the container like that:

docker run --rm --gpus all -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro  grobid/grobid:0.7.2-SNAPSHOT

(where /home/lopez/grobid/grobid-home/config/grobid.yaml is your edited local config file with your email for Crossref politeness)

the server log shows that crossref request go through every second or so ... so there might be something more specific going on.

This is probably too slow... A good rate is to get at least 10 consolidated citations per second to avoid some painful slowness and timeout when parallelizing processing. If it's just a few hundred PDF, you can try the public biblio-glutton (which synchronizes itselft daily with Crossref) with a low concurrency to avoid too heavy load on this cheap server :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants