This project is an Instagram crawler that allows users to scrape data from Instagram locations and posts. It utilizes Selenium and apply multithreads for web automation and BeautifulSoup for HTML parsing. The crawler can extract information such as post counts, authors, and content from specified Instagram locations.
- Scrapes Instagram location data based on search queries.
- Extracts post information including author, creation date, and content.
- Saves the extracted data in JSON format.
- Supports cookie management for login persistence.
- Python 3.x
- Selenium
- BeautifulSoup4
- WebDriver Manager
- Fake User Agent
- Other dependencies specified in
requirements.txt
-
Clone the repository:
git clone https://github.com/yourusername/instagram-crawler.git cd instagram-crawler
-
Install the required packages:
pip install -r requirements.txt
-
The project uses WebDriver Manager to automatically handle the WebDriver for your browser, so no manual installation is required.
-
Run the Script: You can run the script directly from the command line. Make sure to provide your Instagram username and password:
python main.py -u your_username -p your_password
-
Search Query: You can customize the search query by using the
-q
option:python main.py -u your_username -p your_password -q "your_search_query_here"
-
Indexes to Extract: You can specify which indexes to extract using the
-i
option:python main.py -u your_username -p your_password -i 0 1 2 3 4 5 6
-
Follow the prompts in the console to log in to Instagram if required.
- Cookie Management: The crawler saves cookies to maintain login sessions. You can specify your Instagram username and password in the
manual_login_and_save
method incookie_manager.py
.
Contributions are welcome! If you have suggestions for improvements or new features, feel free to open an issue or submit a pull request.
- Selenium - For web automation.
- BeautifulSoup - For HTML parsing.
- WebDriver Manager - For managing browser drivers.
- Fake User Agent - For generating random user agents.
這個項目是一個 Instagram 爬蟲,允許用戶從 Instagram 地點和貼文中抓取數據。它利用 Selenium 以多線程進行網頁自動化,並使用 BeautifulSoup 進行 HTML 解析。該爬蟲可以從指定的 Instagram 地點提取貼文數量、作者和內容等信息。
- 根據搜索查詢抓取 Instagram 地點數據。
- 提取貼文信息,包括作者、建立日期和內容。
- 將提取的數據以 JSON 格式保存。
- 支持 cookie 管理以保持登錄狀態。
- Python 3.x
- Selenium
- BeautifulSoup4
- WebDriver Manager
- Fake User Agent
- 其他在
requirements.txt
中指定的依賴項
-
clone倉庫:
git clone https://github.com/yourusername/instagram-crawler.git cd instagram-crawler
-
安裝所需的packages:
pip install -r requirements.txt
-
該項目使用 WebDriver Manager 自動處理瀏覽器的 WebDriver,因此不需要手動安裝。
-
運行腳本:您可以直接從命令行運行腳本。請確保提供您的 Instagram 用戶名和密碼:
python main.py -u your_username -p your_password
-
搜索查詢:您可以使用
-q
選項自定義搜索查詢:python main.py -u your_username -p your_password -q "your_search_query_here"
-
提取索引:您可以使用
-i
選項指定要提取的索引:python main.py -u your_username -p your_password -i 0 1 2 3 4 5 6
-
如果需要,請按照控制台中的提示登錄 Instagram。
- Cookie 管理:爬蟲保存 cookies 以保持登錄會話。您可以在
cookie_manager.py
中的manual_login_and_save
方法中指定您的 Instagram 用戶名和密碼。
歡迎貢獻!如果您有改進或新功能的建議,請隨時提出問題或提交拉取請求。
- Selenium - 用於網頁自動化。
- BeautifulSoup - 用於 HTML 解析。
- WebDriver Manager - 用於管理瀏覽器驅動程序。
- Fake User Agent - 用於生成隨機用戶代理。