Skip to content

cool9203/Taiwan-news-crawlers

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Taiwan-news-crawlers

🐞 Scrapy-based Crawlers for news of Taiwan including 10 media companies:

  1. 中國時報
  2. 中央社
  3. 華視
  4. 東森新聞雲
  5. 自由時報
  6. 壹蘋新聞網(原蘋果日報)
  7. 公視
  8. 三立
  9. TVBS
  10. UDN

Getting Started

$ git clone https://github.com/cool9203/Taiwan-news-crawlers.git
$ cd Taiwan-news-crawlers
$ pip install -r requirements.txt
$ scrapy crawl apple -o apple_news.json

Prerequisites

  • Python3.7+
  • Scrapy >= 1.3.0 ~ 2.7.0
  • Twisted >= 16.6.0 ~ 22.8.0
  • isort
  • flake8
  • black

Usage

# normal
scrapy crawl <spider> -o <output_name>

# if can crawl assign day
# example want to crawl 2022-10-26
scrapy crawl <spider> -o <output_name> -a start_date=2022-10-26 -a end_date=2022-10-26

# if can crawl old day
# example today is 2022-10-27
# will crawl '2022-10-25'~'2022-10-27'
scrapy crawl <spider> -o <output_name> -a start_date=2022-10-25

Available spiders (all 10)

Spider name Rewrite finished and can crawl Can crawl assign day Can crawl old day Key word(tag) note
china ✔️ ✔️
cna ✔️ not always crawl key word
cts ✔️ ✔️ ✔️ ✔️ always crawl yesterday
ettoday ✔️ ✔️ ✔️ ✔️
liberty ✔️ ✔️
nextapple(origin of apple) ✔️ ✔️ ✔️
pts ✔️ ✔️
setn ✔️ ✔️
tvbs ✔️ ✔️ ✔️ ✔️
udn ✔️ ✔️ ✔️

Output

Key Value
website the publisher
url the origin web
title the news title
content the news content
category the category of news
description the description of news
key_word the key_word of news

License

The MIT License

About

Scrapy-based Crawlers for news of Taiwan

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%