A simple way to generate word clouds from PDFs, Word Documents, and text-based files.
Open in Google Colab and save a copy to use.
Note: The Colab notebook runs best with Google Chrome.
- Single file -> Single word cloud
- Multiple files -> Single word cloud
- Multiple files -> Multiple word clouds + Combined word cloud
Instructions (for running in Colab):
- Run the entire file: From the menu, select
Runtime
thenRun all
- Click on the
Upload Files
button to upload your files - Set the appropriate parameters by toggling the Yes/No buttons in the "Settings" cell.
- Click
Run
and wait for the word clouds to be generated. Wait till you see the message ### Done ### in the cell logs, or a wordcloud with the title "Combined Word Cloud". - Click
Download
to get a zip file of your word cloud images.
- Text in images cannot be read. Here is a workaround to extract text from images.
- Scanned PDFs (You know a PDF was scanned if you can't select text with your mouse when you open it normally)
- SOTA OCR methods are still not perfect
- OCR text recognition takes longer to run
- *Key takeaway: If you can get a machine-generated PDF, use that, else tag your scanned PDF files properly by renaming them to end with
_scanned.pdf
- Sometimes, the required packages fail to install correctly, leading to an error in the logs that says:
ERROR: module 'PIL.Image' has no attribute 'Transpose'
. In this scenario, go toRuntime
in the menu, and selectRestart and run all
. This should fix the problem, and you can go through the steps outlined in the instructions. - Upoading files using the Firefox browser has been known to go a bit wonky. This is an erratic bug in Colab itself. For the best experience, use Google Chrome.