Index pdf document

In general, PDF is a bad format to use extract an index of page numbers for specific keywords. There are, however, some situations in which the text source or other means are not available. So here is one work-around on how to generate an index on a Windows computer.

Using xpdf

Good results can be achieved using the open source software xpdf and a python script. In the first step the pdf is converted to individual html pages using

pdftohtml.exe pdf_file.pdf html

where pdf_file.pdf is the pdf to be analysed and html is the output directory.

From the html output we generate the index by running the python script extract_from_html which assumes the html folder to be present in the same folder and a keywords.txt with the entries to be analysed in the second column. This then generates an html file index.html with the index.

Using PDFQuery

There is the very nice module PDFQuery with which one could also achieve the same and staying within the python universe. However, it seemed to be rather complicated to deal with German text and sort out the character encoding as lxml failed to read in the pages.

So for now, there is only the testing file pdf_keyword_index.py, which might be a basis for future developments.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
extract_from_html.py		extract_from_html.py
pdf_keyword_index.py		pdf_keyword_index.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Index pdf document

Using xpdf

Using PDFQuery

About

Releases

Packages

Languages

scholich/pdf_indexing

Folders and files

Latest commit

History

Repository files navigation

Index pdf document

Using xpdf

Using PDFQuery

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages