CS242 Project

Project Structure

How to get db

from twit.utils import GetMongo_client.
cs_242/src/app/twit/utils.py , the class of GetMongo_client, get the mogodb variable

Usage

Within the root level of the project dir [/cs_242/] run docker-compose up. This will start all the required containers.
Before using the search engine you should load data into the db, create the lucene and hadoop index. The steps to do so are below.
After everthing is up and running click the link to go to the apps main page http://localhost:1337/twit/.

Capture the stream

After compose up, enter the django_twitter container

docker exec -it django_twitter bash

Run management cmds

cd src/app

python manage.py run-tweepy 1 # this collect 1Mb of data
# OR
python manage.py run-tweepy 1 -p 2 # this collect 1Gb of data with 2 parallel process limit 1 per account

Load data into DB

download the tweets.json file from google drive tweets.json. Only RMail account can download the link.
place the json file in the resources/storage/ directory of the app
After compose up, enter the django_twitter container

docker exec -it django_twitter bash

Run management cmds

cd src/app

python manage.py load-csv -fp 'downloaded_file_name'

# OR load sample

python manage.py load-csv -fp 'twit_tweet-standard.json'

# files MUST exists in resources/storage/ directory
# the resources/storage/ path is implied
# this file's full path would be resources/storage/twit_tweet-standard.json

create index [Lucene]

After compose up, enter the django_twitter container

docker exec -it django_twitter bash

Run management cmds

cd src/app
python manage.py index-tweets 

# creates two indexes ['tweet_index', 'tag_index']
# indexes are located in resources/storage/index_name

create index [Hadoop]

After compose up, enter the namenode container

docker exec -it namenode bash

Run management cmds

cd home/hadoopMR
sh exec.sh  

# sh exec.sh works as such: it exports the data from 'twit_tweet' collection in csv 
# and runs the Hadoop Map-Reduce jobs with the data, which internally calls 
# the ranking function and stores the final indexed-ranked documents in the DB.

DB GUI | MongoDB Express [MongoDB GUI]

After compose up, Go to http://localhost:8081/db/django/,
- twit_tweet collection contains all tweets data from twitter
- ranked_index collection contains the hadoop inverted index

Changelog

first commit * skeleton code for cs 242 project. made with django, mongodb, docker
Jan 31 * Complete Twitter crawling

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
docker-hadoop-master		docker-hadoop-master
docs		docs
hadoopMR		hadoopMR
resources		resources
run		run
scripts		scripts
src/app		src/app
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
dev.env		dev.env
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS242 Project

Project Structure

How to get db

Usage

Capture the stream

Load data into DB

create index [Lucene]

create index [Hadoop]

DB GUI | MongoDB Express [MongoDB GUI]

Changelog

About

Releases

Packages

Contributors 7

Languages

acrosdale/cs_242

Folders and files

Latest commit

History

Repository files navigation

CS242 Project

Project Structure

How to get db

Usage

Capture the stream

Load data into DB

create index [Lucene]

create index [Hadoop]

DB GUI | MongoDB Express [MongoDB GUI]

Changelog

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages