Node.js version: https://github.com/KORINZ/nhk-news-scraper-js
This project is a Python script for scraping news articles from NHK News Web Easy, a website that provides news articles written in simpler Japanese, suitable for language learners. The script extracts the article's URL, title, content, and essential vocabulary along with their furigana (hiragana reading) and generates a quiz for students based on the scraped article.
See the .txt
file in the repository for an example output.
本プロジェクトは、語学学習者に適したより簡単な日本語で書かれたニュース記事を提供するサイト「NHKニュースウェブイージー」からニュース記事をスクレイピングするためのPythonスクリプトです。記事のURL、タイトル、内容、必須語彙をふりがなとともに抽出し、スクレイピングされた記事をもとに学生向けのクイズを生成するスクリプトです。
出力例については、リポジトリにある.txt
ファイルを参照してください。
- Extract a random news article from NHK News Web Easy
- Save article details (URL, date, title, content) and featured vocabularies (with furigana) in a text file
- Generate a daily quiz for students based on the scraped article
- Send customized quizzes, messages or stickers to LINE with Python GUI
- Automatically receive (via Google Apps Script) and evaluate answers and upload them to Google Sheets (via Python)
- Check sentiment scores for the news article
- Translate news articles/vocabularies to other languages via DeepL API with command line interface
Tested on Python 3.11 with Windows 11, WSL (Ubuntu 20.04), and macOS Ventura.
Required:
chardet
BeautifulSoup4
Selenium
webdriver_manager
requests
line-bot-sdk
customtkinter
Optional (check_grade_book.py):
pandas
gspread
tabulate
Optional (check_sentiment.py):
transformers
scipy
torch
torchvision
torchaudio
fugashi[unidic]
ipadic
Optional (translate.py):
deepl
Note: currently, fugashi
will not work on Python downloaded from Microsoft Store. You will need to install Python from the official website if you want to use sentiment analysis.
- Sign up for a LINE official account.
- Get your own
CHANNEL_ACCESS_TOKEN
(チャネルアクセストークン) andUSER_ID
(あなたのユーザーID) from LINE Developers Messaging API Settings. - For macOS users, installation of MeCab is required if you want to use sentiment analysis:
brew install mecab
- Clone this repository:
git clone https://github.com/KORINZ/nhk_news_web_easy_scraper.git
- Install the required packages listed in the dependencies (make sure you are inside the cloned repository folder):
pip install -r requirements.txt
- To run GUI:
python customtkinter_GUI.py
- To run on the terminal:
python main.py
-
The script will generate a text file
news_article.txt
containing the article's URL, date, title, content, and essential vocabulary (with furigana and defintions) from a random news article. -
text files for quizzes and logging will also be generated.
- Install Japanese fonts:
sudo apt update
sudo apt install -y fonts-ipafont
- Install tkinter; replace
xx
with your Python version:
sudo apt-get install python3.xx-tk
- Install support for Linux GUI apps, see:
https://learn.microsoft.com/en-us/windows/wsl/tutorials/gui-apps
- Click on
クイズ作成
to scrap a random news article and generate quizzes. - Click on
LINE機密情報入力
inside設定
tab to fill in yourCHANNEL_ACCESS_TOKEN
(チャネルアクセストークン) andUSER_ID
(あなたのユーザーID). - Click on
LINEに発信
to send the quiz. - pending
- Set up a Google Cloud Platform account is required (https://console.cloud.google.com/).
- pending
- pending
- Create a database to store all past news articles, vocabularies, and quizzes
- Improve the formatting of the output text file
- Add translation to quiz vocabulary
Note that this script is for educational purposes only. When using the scraped content, follow the copyright laws and regulations applicable in your country. Make sure to properly cite the content's source and respect the content owners' intellectual property rights.