This project consists in two modules, the webcrawler for crawling data from news websites, and noah for crawling data from Twitter.
Prerequisites: scala, sbt, AsterixDB, TextGeoLocator
Follow the official documentation to setup a fully functional cluster.
Webcrawler is an application that integrates with Webhose.io API to crawl data from news websites, geotag and ingest it on AsterixDB.
Parameters description:
-tk
or--api-key
: Webhose.io API key-kw
or--keywords
: Keywords to search for in the news-co
or--country-code
: Thread country code-ds
or--days-ago
: Crawl since the given number of days ago, default 1-tglurl
or--textgeolocatorurl
: Url of the TextGeoLocator API, default: "http://localhost:9000/location"-u
or--url
: Url of the feed adapter-p
or--port
: Port of the feed socket-w
or--wait
: Waiting milliseconds per record, default 500-b
or--batch
: Batchsize per waiting periods, default 50-c
or--count
: Maximum number to feed, default unlimited-fo
or--file-only
: Only store in a file, do not geotag nor ingest, default false
You can run the following example command in a separate command line window:
>cd crawler
>sbt "project webcrawler" "run-main Crawler \
>-tk "Your Webhose.io token" \
>-kw "dengue", "zika", "zikavirus", "microcefalia", "febreamarela", "chikungunya" \
>-co "BR" \
>-ds 1 \
>-tglurl "http://localhost:9000/location" \
>-u 127.0.0.1 \
>-p 10010 \
>-w 0 \
>-b 50"
Noah is a module that continuously crawls new tweets that mentions a specified keyword, geotag and ingests it on AsterixDB.
Parameters description:
-ck
or--consumer-key
: ConsumerKey for Twitter API OAuth-cs
or--consumer-secret
: Consumer Secret for Twitter API OAuth-tk
or--token
: Token for Twitter API OAuth-ts
or--token-secret
: Token secret for Twitter API OAuth-tr
or--tracker
: Tracked terms-u
or--url
: Url of the feed adapter-p
or--port
: Port of the feed socket-w
or--wait
: Waiting milliseconds per record, default 500-b
or--batch
: Batchsize per waiting periods, default 50-c
or--count
: Maximum number to feed, default unlimited-fo
or--file-only
: Only store in a file, do not geotag nor ingest, default false
You can run the following example command in a separate command line window:
> cd crawler
> sbt "project noah" "run-main edu.uci.ics.cloudberry.noah.feed.TwitterFeedStreamDriver \
> -ck Your consumer key \
> -cs Your consumer secret \
> -tk Your token \
> -ts Your token secret \
> -tr dengue zikavirus microcefalia febreamarela chikungunya\
> -u 127.0.0.1 \
> -p 10001 \
> -w 0 \
> -b 50"
- The Noah and Gnosis modules were adapted from TwitterMap.
- Currently, geotagging works only for Brazil.
- Users and developers are welcome to contact me through moniquemoledo@id.uff.br