Taiwan-news-crawlers

🐞 Scrapy-based Crawlers for news of Taiwan including 10 media companies:

中國時報
中央社
華視
東森新聞雲
自由時報
壹蘋新聞網(原蘋果日報)
公視
三立
TVBS
UDN

Getting Started

$ git clone https://github.com/cool9203/Taiwan-news-crawlers.git
$ cd Taiwan-news-crawlers
$ pip install -r requirements.txt
$ scrapy crawl apple -o apple_news.json

Prerequisites

Python3.7+
Scrapy >= 1.3.0 ~ 2.7.0
Twisted >= 16.6.0 ~ 22.8.0
isort
flake8
black

Usage

# normal
scrapy crawl <spider> -o <output_name>

# if can crawl assign day
# example want to crawl 2022-10-26
scrapy crawl <spider> -o <output_name> -a start_date=2022-10-26 -a end_date=2022-10-26

# if can crawl old day
# example today is 2022-10-27
# will crawl '2022-10-25'~'2022-10-27'
scrapy crawl <spider> -o <output_name> -a start_date=2022-10-25

Available spiders (all 10)

Spider name	Rewrite finished and can crawl	Can crawl assign day	Can crawl old day	Key word(tag)	note
china	✔️	❌	❌	✔️
cna	✔️	❌	❌	✅	not always crawl key word
cts	✔️	✔️	✔️	✔️	always crawl yesterday
ettoday	✔️	✔️	✔️	✔️
liberty	✔️	❌	❌	✔️
nextapple(origin of apple)	✔️	❌	✔️	✔️
pts	✔️	❌	❌	✔️
setn	✔️	❌	❌	✔️
tvbs	✔️	✔️	✔️	✔️
udn	✔️	❌	✔️	✔️

Output

Key	Value
website	the publisher
url	the origin web
title	the news title
content	the news content
category	the category of news
description	the description of news
key_word	the key_word of news

License

The MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
TaiwanNewsCrawler		TaiwanNewsCrawler
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Taiwan-news-crawlers

Getting Started

Prerequisites

Usage

Available spiders (all 10)

Output

License

About

Releases

Packages

Languages

License

cool9203/Taiwan-news-crawlers

Folders and files

Latest commit

History

Repository files navigation

Taiwan-news-crawlers

Getting Started

Prerequisites

Usage

Available spiders (all 10)

Output

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages