Website Content Extractor in Python
£20-250 GBP
Pagado a la entrega
Request for Proposal (RFP)
Project Title
Python Program for Extracting Articles from a Website Site Map into .docx Files
Project Overview
We are seeking a proficient Python developer to create a program that extracts articles from a specific website’s site map (e.g., [login to view URL]) and downloads each article published within a specified time range (e.g., the past 24 hours). Each article should be saved into a separate .docx file, named according to the publication date and time. The final program should be user-friendly and well-documented to allow non-technical users to configure and run the script.
Project Scope and Deliverables
1. Python Script (.py file):
• Develop a Python script that takes a site map URL (e.g., [login to view URL]) as input and extracts all article URLs from the specified page.
• Implement logic to filter articles based on a given time range (e.g., the last 24 hours or between specific start and end times).
• Download each article found in the specified time range and save it in a separate .docx file. The .docx file should include:
• Title of the article (as the document header)
• URL of the source page
• Publication Date and Time
• Author Name (if available)
• Main Body Text
• Filename format: [login to view URL] (e.g., [login to view URL]).
• Implement options to include/exclude metadata (e.g., tags, categories) as needed.
2. Output Files:
• Each article should be saved in a separate .docx file in the specified output directory.
• Store additional metadata or a summary file (e.g., a .csv file listing all downloaded articles with their URLs and publication times) if needed.
3. User Interface & Usability:
• Provide a user-friendly interface or command-line options for configuring parameters such as:
• Site Map URL: Input the URL of the site map page (e.g., [login to view URL]).
• Time Range: Specify a time range for filtering articles (e.g., “last 24 hours” or between YYYY-MM-DD and YYYY-MM-DD).
• Output Directory: Set the destination folder for saving the downloaded .docx files.
• Error handling should be robust, with clear messages for common issues (e.g., “Invalid site map URL” or “No articles found in the specified time range”).
4. Detailed Documentation:
• Provide a README file with:
• Installation instructions (including dependencies).
• Detailed usage instructions, covering:
• How to set up and run the script.
• How to specify the time range and site map URL.
• Optional configuration settings.
• Troubleshooting guide for common errors.
5. Code Quality:
• The code should be clean, modular, and well-commented, adhering to Python best practices and the PEP8 coding standard.
• Use meaningful variable names and clear function structures.
Technical Requirements
1. Programming Language: Python (Latest stable version).
2. Libraries:
• Suggested libraries include BeautifulSoup, requests, lxml, and python-docx.
• The developer can recommend additional libraries as needed, but must document their usage in a [login to view URL] file.
3. Environment Compatibility: The script should be compatible with Windows and Unix-based systems.
4. Time Range Specification: Implement logic to handle time ranges in hours or days (e.g., articles published within the last 24 hours, or between specific start and end dates).
5. Data Compliance: Ensure the solution adheres to the target website’s Terms of Service and does not violate any legal restrictions.
Project Timeline
The project is expected to be completed within 4 weeks from the award date, with the following milestones:
1. Day 1: Initial project setup and development of site map extraction module.
2. Day 2: Implementation of time range filtering and .docx export functionality.
3. Day 3: Internal testing and optimization of the script.
4. Day 4: Delivery of a beta version for client review, followed by final adjustments and delivery of the completed project.
Project Budget
Proposals should include a detailed cost breakdown, including estimated hours for each development phase and any additional costs for third-party libraries or tools.
Submission Requirements
1. Proposal Submission Deadline: [Insert Deadline Date]
2. Proposal Format:
• Company or freelancer profile.
• Portfolio of relevant Python and web scraping projects.
• Proposed approach and implementation strategy.
• Project timeline and cost estimate.
• Contact details.
3. Evaluation Criteria:
• Expertise in Python programming, web scraping, and data extraction.
• Experience in working with .docx file formats.
• Ability to create a user-friendly solution.
• Adherence to the timeline and budget constraints.
Submission Contact
All proposals should be submitted to:
• Contact Name:
• Email Address:
Additional Notes
1. The developer must provide post-delivery support for a period of 2 weeks to address any bugs or issues discovered in the program.
2. All intellectual property rights to the source code and documentation will be transferred to the client upon project completion and final payment.
3. Any changes to the project scope should be mutually agreed upon and documented.
Nº del proyecto: #38665755
Sobre el proyecto
63 freelancers están ofertando un promedio de £168 por este trabajo
I am Python developer familiar with web scraping and data extraction, and I can create a user-friendly Python program to extract articles from a specific website's site map and save them as .docx files based on specifi Más
Top 1% in Freelancer.com Hi, Greetings! ✅checked your project details: ✅Completed Time: In project deadline We have worked on 900 + Projects. I have 6 + years of the experience in same kind of projects. If you are look Más
As the Senior Full Stack Developer with over six years of experience, I have gained extensive knowledge and expertise in a wide range of programming areas that align perfectly with this project's requirements. My deep Más
Hello, Sai Wing L. my name is Prayogo, and I have been working as a Full-stack Engineer for 12 years. I have carefully read your job description and feel confident that I can successfully complete your project. I am pr Más
Hi Thank you for the opportunity to bid on your project. Bid Proposal: Experienced Python developer ready to kick off your project to create a Python program that extracts and downloads articles from a designated w Más
Hi, Hope you are doing well and good. I have already scraped techcrunch site using php and node headless browser. Can we discuss more about the job.
Hi there, How are you? I am a software engineer. Let's do this right now. I am a Python, Django, and Flask developer. I know Python, Beautiful Soup, Scrapy, Playwright, Requests, Urllib3, Selenium, Chromium, and other Más
Estimated time line: One day. Budget: 20.0GBP fixed. I will provide the first result within one day. Dear [Client], I have successfully completed similar projects involving Python web scraping and document extracti Más
Hi, I have extensive experience in Python development, particularly in web scraping, data extraction, and automation using libraries like BeautifulSoup, requests, lxml, and python-docx. My previous work includes build Más
Hello, i have a good experience scraping variety of sites with python, i can start right away, contact me to discuss more project details, thanks
Greetings! With ample of experience in python, I can get your job done efficiently. Kindly confirm if you are good to proceed with this deal. Good day! Thanks Abhishek