In general, PDF is a bad format to use extract an index of page numbers for specific keywords. There are, however, some situations in which the text source or other means are not available. So here is one work-around on how to generate an index on a Windows computer.
Good results can be achieved using the open source software xpdf and a python script. In the first step the pdf is converted to individual html pages using
pdftohtml.exe pdf_file.pdf html
where pdf_file.pdf
is the pdf to be analysed and html
is the output directory.
From the html output we generate the index by running the python script extract_from_html
which assumes the html
folder to be present in the same folder and a keywords.txt
with the entries to be analysed in the second column. This then generates an html file index.html
with the index.
There is the very nice module PDFQuery with which one could also achieve the same and staying within the python universe. However, it seemed to be rather complicated to deal with German text and sort out the character encoding as lxml failed to read in the pages.
So for now, there is only the testing file pdf_keyword_index.py
, which might be a basis for future developments.