from twit.utils import GetMongo_client
.
cs_242/src/app/twit/utils.py
, the class of GetMongo_client, get the mogodb variable
-
Within the root level of the project dir [/cs_242/] run
docker-compose up
. This will start all the required containers. -
Before using the search engine you should load data into the db, create the lucene and hadoop index. The steps to do so are below.
-
After everthing is up and running click the link to go to the apps main page http://localhost:1337/twit/.
- After compose up, enter the
django_twitter
container
docker exec -it django_twitter bash
- Run management cmds
cd src/app
python manage.py run-tweepy 1 # this collect 1Mb of data
# OR
python manage.py run-tweepy 1 -p 2 # this collect 1Gb of data with 2 parallel process limit 1 per account
-
download the tweets.json file from google drive tweets.json. Only RMail account can download the link.
-
place the json file in the resources/storage/ directory of the app
-
After compose up, enter the
django_twitter
container
docker exec -it django_twitter bash
- Run management cmds
cd src/app
python manage.py load-csv -fp 'downloaded_file_name'
# OR load sample
python manage.py load-csv -fp 'twit_tweet-standard.json'
# files MUST exists in resources/storage/ directory
# the resources/storage/ path is implied
# this file's full path would be resources/storage/twit_tweet-standard.json
- After compose up, enter the
django_twitter
container
docker exec -it django_twitter bash
- Run management cmds
cd src/app
python manage.py index-tweets
# creates two indexes ['tweet_index', 'tag_index']
# indexes are located in resources/storage/index_name
- After compose up, enter the
namenode
container
docker exec -it namenode bash
- Run management cmds
cd home/hadoopMR
sh exec.sh
# sh exec.sh works as such: it exports the data from 'twit_tweet' collection in csv
# and runs the Hadoop Map-Reduce jobs with the data, which internally calls
# the ranking function and stores the final indexed-ranked documents in the DB.
- After compose up, Go to http://localhost:8081/db/django/,
twit_tweet
collection contains all tweets data from twitterranked_index
collection contains the hadoop inverted index
- first commit * skeleton code for cs 242 project. made with django, mongodb, docker
- Jan 31 * Complete Twitter crawling