🐞 Scrapy-based Crawlers for news of Taiwan including 10 media companies:
- 中國時報
- 中央社
- 華視
- 東森新聞雲
- 自由時報
- 壹蘋新聞網(原蘋果日報)
- 公視
- 三立
- TVBS
- UDN
$ git clone https://github.com/cool9203/Taiwan-news-crawlers.git
$ cd Taiwan-news-crawlers
$ pip install -r requirements.txt
$ scrapy crawl apple -o apple_news.json
- Python3.7+
- Scrapy >= 1.3.0 ~ 2.7.0
- Twisted >= 16.6.0 ~ 22.8.0
- isort
- flake8
- black
# normal
scrapy crawl <spider> -o <output_name>
# if can crawl assign day
# example want to crawl 2022-10-26
scrapy crawl <spider> -o <output_name> -a start_date=2022-10-26 -a end_date=2022-10-26
# if can crawl old day
# example today is 2022-10-27
# will crawl '2022-10-25'~'2022-10-27'
scrapy crawl <spider> -o <output_name> -a start_date=2022-10-25
Spider name | Rewrite finished and can crawl | Can crawl assign day | Can crawl old day | Key word(tag) | note |
---|---|---|---|---|---|
china | ✔️ | ❌ | ❌ | ✔️ | |
cna | ✔️ | ❌ | ❌ | ✅ | not always crawl key word |
cts | ✔️ | ✔️ | ✔️ | ✔️ | always crawl yesterday |
ettoday | ✔️ | ✔️ | ✔️ | ✔️ | |
liberty | ✔️ | ❌ | ❌ | ✔️ | |
nextapple(origin of apple) | ✔️ | ❌ | ✔️ | ✔️ | |
pts | ✔️ | ❌ | ❌ | ✔️ | |
setn | ✔️ | ❌ | ❌ | ✔️ | |
tvbs | ✔️ | ✔️ | ✔️ | ✔️ | |
udn | ✔️ | ❌ | ✔️ | ✔️ |
Key | Value |
---|---|
website | the publisher |
url | the origin web |
title | the news title |
content | the news content |
category | the category of news |
description | the description of news |
key_word | the key_word of news |
The MIT License