Enhance and Debug PDF Data Extraction Program (Python)
$50-100 USD
Paid on delivery
I’m looking for an experienced Python programmer to debug, improve, and optimize a PDF data extraction program that I’ve been developing. The program is designed to extract specific information (e.g., ID Constructie, Suprafata constructie, Nr. CF, Nume Proprietar, and Intravilan status) from Romanian land registry PDFs. While it’s functional in some areas, certain parts do not work as expected, and the code overall needs optimization.
What the Program Does:
Extracts data from PDF files using pdfplumber and regex.
Extracts key fields like:
ID Constructie (Construction IDs)
Suprafata constructie (Construction Area)
Nr. CF (Land Registry Number)
Nume Proprietar (Owner Name)
Intravilan status (Whether the land is "DA" or "NU")
Categorie de folosinta (Land Usage Category)
The extracted data is saved into an Excel file using pandas and openpyxl.
Issues Faced:
ID Construcție & Suprafață Construcție:
These fields are not extracted accurately. The correct logic should be based on the A1.x format for IDs and values following specific patterns. Currently, the function doesn't meet expectations.
Inconsistent PDF Formats:
PDFs often vary in structure, especially for key phrases like "Date referitoare la teren" or "Lungime Segmente". Some PDFs lack these sections altogether, causing failures.
Fallback Mechanisms:
When sections like "679/2016" are missing, the program should search alternate ranges, but this logic needs fine-tuning.
Other Enhancements:
General code improvements: Robust error handling, optimized regular expressions, and flexibility to adapt to varied PDF layouts.
Here's a solid description for your Freelancer post, tailored to your specific needs and project progress so far:
Project Title:
"Debug and Enhance PDF Data Extraction Program (Python)"
Description:
I have a Python program designed to extract specific data fields from PDFs, such as property documents ("Cărți Funciare"). The program uses libraries like pdfplumber, re, and pandas to process the PDFs and output results into an Excel file. While the core functionality is implemented, there are issues and areas for improvement that need an expert to resolve.
What the Program Does:
The current program extracts:
Nr. CF - Land registry number.
Nume Proprietar - Owner's name(s).
Suprafață Teren - Land area.
ID Construcție & Suprafață Construcție - IDs of constructions and their respective areas.
Intravilan - Status ("DA" or "NU") indicating land classification.
Categorie de Folosință - Category of land usage (e.g., Arabil, Padure, etc.).
The extracted data is then saved into an Excel file using pandas and openpyxl.
Issues Faced:
ID Construcție & Suprafață Construcție:
These fields are not extracted accurately. The correct logic should be based on the A1.x format for IDs and values following specific patterns. Currently, the function doesn't meet expectations.
Inconsistent PDF Formats:
PDFs often vary in structure, especially for key phrases like "Date referitoare la teren" or "Lungime Segmente". Some PDFs lack these sections altogether, causing failures.
Fallback Mechanisms:
When sections like "679/2016" are missing, the program should search alternate ranges, but this logic needs fine-tuning.
Other Enhancements:
General code improvements: Robust error handling, optimized regular expressions, and flexibility to adapt to varied PDF layouts.
What I Need:
Debug and fix the extraction of "ID Construcție" and "Suprafață Construcție". IDs should be accurately matched (e.g., "A1.x" format).
Improve extract_intravilan_status and ensure it searches multiple ranges if one fails.
Enhance program flexibility to handle PDFs with inconsistent or missing sections.
Clean and optimize regular expressions and search logic for better accuracy.
Implement fallback mechanisms for edge cases when specific sections are not found.
Review other functions (like extract_categorie_folosinta and extract_nume_proprietar) and improve efficiency and reliability.
Ideal Skills:
- Proficient in Python
- Experience with PDF data extraction
- Strong debugging skills
- Ability to enhance program functionality
- Familiarity with handling varied data formats
Deliverables:
A working Python script with improved functionality.
Debugged and accurate extraction for all required fields (IDs, areas, intravilan status, etc.).
Clear documentation on updates made, especially new logic added.
Program capable of handling varied PDF formats and edge cases.
Project ID: #38897992
About the project
36 freelancers are bidding on average $83 for this job
Top 1% in Freelancer.com Hi, Greetings! ✅checked your project details: ✅Completed Time: In project deadline We have worked on 900 + Projects. I have 6 + years of the experience in same kind of projects. If you are look More
No challenge is too complex for me to overcome, and bringing your PDF data extraction program to its full potential is my top priority. My proficiency in Python, especially with renowned libraries like pdfplumber, re, More
I can help by leveraging my expertise in Python and document processing, specifically with the PDF-based models I use for conversion, extraction, updating, and merging. My experience with open-source ERP systems allows More
I am confident that I am the ideal candidate to enhance and debug your PDF data extraction program. With my in-depth comprehension of the Python environment, as well as extensive experience working with libraries such More
Hello Hope everything is well with you.I read your job description and I'm interested in it. I did such projects before related to PDF data extraction. Let's discuss more details.
Hello! I’m an experienced Python programmer with a strong background in PDF data extraction and optimization. I can help debug, improve, and optimize your existing PDF data extraction program to ensure accuracy, effici More
As a seasoned Python pro, I possess the tenacity and precision that debugging entails. I can troubleshoot the current challenges you're confronting with "ID Construcție" and "Suprafață Construcție", ensuring they are a More
Dear client, Hope you are doing well! I am passionate about data and have a lot of experience in handling it. I have worked on many projects where I build strong and reliable systems to collect, clean, and store data More
Hi, there. I have read your job detail carefully and I can do this project "Enhance and Debug PDF Data Extraction Program (Python)". As a Full Stack Web developer, for last 7 years, I've developed many web applications More
Hhi I am experienced in this and I can start right now but i have few doubts and questions lets have a quick chat and get it started waiting for your replyyy ! r