A comprehensive system for analyzing resumes in both PDF and Word document formats, extracting key information, and generating detailed analysis reports.
- Multi-format Support: Process both PDF and Word documents (
.pdf
,.docx
,.doc
) - Intelligent Text Extraction:
- PDF processing using LlamaParse
- Word document processing using python-docx
- Comprehensive Analysis:
- Contact information extraction
- Technical skills assessment
- Education history analysis
- Experience evaluation
- Project analysis
- Overall fit scoring
- Output Formats:
- Detailed Markdown reports
- CSV format for data analysis
- Filtered reports for top candidates
pip install -r requirements.txt
Required packages:
- python-docx
- llama-parse
- pandas
- pydantic
- pydantic-ai
- python-dotenv
- nest-asyncio
- Create a
.env
file in the root directory - Add your API keys:
OPENAI_API_KEY=your_openai_api_key
LLAMA_CLOUD_API_KEY=your_llama_cloud_api_key
├── main.py # Main resume processing script
├── filter_high_scores.py # Script for filtering top candidates
├── results/ # Output directory (git ignored)
│ ├── resume_analysis_results.md
│ ├── resume_analysis_results.csv
│ └── top_candidates.md
├── CVs/ # Directory containing resumes
│ └── cv_leads/ # Subdirectory for resumes (git ignored)
├── .gitignore # Git exclusion patterns
└── requirements.txt
Note: The results/
directory and CVs/cv_leads/
directory are excluded from git tracking to avoid committing sensitive data and large files. These directories will be created locally when running the scripts.
Run the main script to process all resumes:
python main.py
This will:
- Process all PDF and Word documents in the specified directory
- Generate detailed analysis for each resume
- Save results in both markdown and CSV formats
- Create a results directory if it doesn't exist
After processing resumes, run the filtering script:
python filter_high_scores.py
This will:
- Read the CSV results file
- Filter candidates with scores >= 8.0
- Generate a new markdown file with detailed information about top candidates
- Include a summary of the filtering results
The system analyzes the following aspects:
-
Contact Information:
- Full name
- Email address
- Phone number
-
Technical Skills:
- Python experience and frameworks
- Other programming languages
- Django experience
- SQL proficiency
- Cloud platform experience (Azure, AWS)
- GitHub repositories and profiles
-
Education:
- Degrees (Bachelor's, Master's, PhD)
- Universities and graduation years
- Awards and certifications
-
Experience:
- Years of relevant experience
- Data science projects
- Healthcare industry experience
- Leadership roles
-
Scoring System:
- Technical expertise (40%)
- Relevant experience (30%)
- Education (20%)
- Leadership potential (10%)
-
Markdown Report (
resume_analysis_results.md
):- Detailed analysis for each candidate
- Formatted sections with emojis
- Easy to read and share
-
CSV File (
resume_analysis_results.csv
):- Structured data format
- Easy to import into other tools
- Suitable for further analysis
-
Top Candidates Report (
top_candidates.md
):- Filtered view of best candidates
- Sorted by score
- Summary statistics
The system includes:
- Retry logic for API calls
- Graceful handling of missing data
- Error reporting for failed processing
- File format validation
In filter_high_scores.py
, modify the min_score
parameter:
filter_and_save_high_scores(input_csv, output_md, min_score=7.0) # Change to desired threshold
In both scripts, update the output paths:
markdown_file = 'your/custom/path/results.md'
csv_file = 'your/custom/path/results.csv'
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- LlamaParse for PDF processing
- python-docx for Word document processing
- Pandas for data manipulation