Skip to content

Streamlit-based Python web scraper for text, images, and PDFs. User-friendly interface for quick data extraction from websites. Simplify your web scraping tasks effortlessly.

License

Notifications You must be signed in to change notification settings

madhurimarawat/Web-Scrapper-Functions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 Cannot retrieve latest commit at this time.

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web-Scrapper-Functions

Streamlit-based Python web scraper for text, images, and PDFs. User-friendly interface for quick data extraction from websites. Simplify your web scraping tasks effortlessly.

Website Image

Website Image

Text File Image


Mode of Execution Used PyCharm Streamlit

Pycharm

  • Visit the official website of pycharm: PyCharm
  • Download according to the platform that will be used like Linux, Macos or Windows.
    Two versions of Pycharm are available:
  1. Community version
  • Community version is open source and we can use it for free without any paid plan.
  • We can download this at the end of pycharm website.
  • After downloading community version we can directly follow the setup wizard and it will be setup.

  1. Professional Version
  • This is available at the top of website, we can directly download from there.
  • After downloading professional version, follow the below steps.
  • Follow the setup wizard and sign up for the free version (trial version) or else continue with the premium or paid version.

Using Pycharm

  • First, in pycharm we have the concept of virtual environment. In virtual environment we can install all the required libraries or frameworks.
  • Each project has its own virtual environment, so thath we can install requirements like Libraries or Framworks for that project only.
  • After this we can create a new file, various file types are available in pycharm like script files, text files and also Jupyter Notebooks.
  • After selecting the required file type, we can continue the execution of that file by saving it and using this shortcut shift+F10 (In Windows).
  • Output is given in Console while installation happens in terminal in Pycharm.

Streamlit Server

  • Streamlit is a python framework through which we can deploy any machine learning model and any python project with ease and without worrying about the frontend.
  • Streamlit is very user-friendly.
  • Streamlit has pre defined functions for all frontend components and we can directly use them.
  • To install streamlit in your system, just run this command-
pip install streamlit

Running Project in Streamlit Server

Make Sure all dependencies are already satisfied before running the app.

  1. We can Directly run streamlit app with the following command-
streamlit run app.py

where app.py is the name of file containing streamlit code.

By default, streamlit will run on port 8501.

Also we can execute multiple files simultaneously and it will be executed in next ports like 8502 and so on.

  1. Navigate to URL http://localhost:8501

You should be able to view the homepage of your app.

🌟 Project and Models will change but this process will remain the same for all Streamlit projects.

Deploying using Streamlit

  1. Visit the official website of streamlit : Streamlit
  2. Now make an account with GitHub.
  3. Now add all the code in Github repository.
  4. Go to streamlit and there is an option for new deployment.
  5. Type your Github repository name and specify the file name. If you name your file as streamlit_app it will directly access it else you have to specify the path.
  6. Now also make sure you upload all your libraries and requirement name in a requirement.txt file.
  7. Version can also be mentioned like this python==3.9.
  8. When we mention version in the requirement file streamlit install all dependencies from there.
  9. If everything went well our app will be deployed on web and you can share the link and access the app from all browsers.

About Project :

Complete Description about the project and resources used.

  • Embedded Links: Extracts and provides embedded links within a website.

  • Main Website Text Data: Gathers and presents the primary textual content from the main website.

  • Main Website Text Data along with Embedded Links Text Data: Combines main website text data with text data from embedded links.

  • Complete Website Text Data: Retrieves and displays the entire textual content of the website.

  • Extract Text from PDF Link: Retrieves and extract data of PDF file using the PDF Link Provided.

  • Main Website PDF Data along with Embedded Links PDF Data: Merge Text data extracted from PDF file in the main website with PDF data extracted from embedded links.

  • Complete Website PDF Data: Captures and retrieves the PDF data available across the entire website.

  • Complete Website Text and PDF Data: Presents a comprehensive view of both textual and PDF data from the entire website.

  • Download PDF Files From Main Website: Facilitates the selective download of PDF files from the main website.

  • Download All PDF Files From Website: Enables the bulk download of all PDF files available on the website.

  • Download Image Files From Main Website: Allows the selective download of image files from the main website.

  • Download All Image Files From Website: Supports the bulk download of all image files present on the website.

  • Visit Website from : Web Scraper


Libraries Used πŸ“š πŸ’»

Short Description about all libraries used.

To install python library this command is used-

pip install library_name
  • Streamlit: Simplifies the creation of Python web applications, ideal for data scientists to effortlessly share interactive data visualizations.
  • Requests: Powerful Python module for HTTP requests, offering a streamlined API to integrate web services with ease.
  • Lxml:Lxml is a powerful and efficient Python library for processing XML and HTML documents, providing a comprehensive toolkit for parsing, validating, and manipulating structured data.
  • Beautiful Soup (bs4): Python library for web scraping tasks, facilitating the extraction of valuable information from HTML and XML documents.
  • PyPDF2: Focuses on handling PDF documents in Python, enabling tasks such as merging, splitting, and text/image extraction from PDF files.
  • io: The Python io module provides a versatile set of tools for managing input and output streams, supporting files, strings, and memory buffers.
  • Zipfile: Python module for creating, extracting, and manipulating ZIP archives, essential for efficient data storage and transfer tasks.

Web Scraper Limitations

The web app has a limit of 100MB. Once this limit is reached, the app will not be able to scrape additional content. If you need to scrape more content, please run the app locally.


Thanks for Visiting πŸ˜„