straw

A platform for real-time streaming search

The goal of this project is to provide a clean, scalable architecture for real-time search on streaming data. Additionally, the project contains utilities to provide some very simple throughput benchmarking of Elasticsearch Percolators vs Lucence-Luwak. Preliminary benchmarks may be found in:

http://straw.ryanwalker.us/about

Comments and critiques about these benchmarks are greatly appreciated!

This project was inspired by the following excellent blog posts on streaming search:

I completed this project as a Fellow in the 2015C Inisght Data Engineering Silicon Valley program.

What's included:

Automated AWS cluster deployment utilities using boto3
Java based Storm implementation:
- KafkaSpout for query and document spouts
- Two flavors of streaming search bolts:
  - Elasticsearch-Percolators
  - Pure Lucene with Luwak
- Storm toplogy for streaming search and configuration management
Scripts to populate document streams, including twitter API sampling utilities
Simple Python flask web UI
Testing and other utilities, including Docker components so that the entire topology can run on a local machine

Architechture

The core of the platform is an Apache Storm cluster which parallelizes the work of real-time streaming search. Internally, the Storm cluster consumes messages from a Kafka cluster and these messages are distributed to bolts which each contain a Lucene-Luwak index. The project contains a demo flask UI which handles subscriptions with a Redis PUBSUB system.

More about the architecture can be found at: http://straw.ryanwalker.us/about

Getting started

Running locally

install docker-compose and redis-server
run src/util/stage_demo_mode.sh This will create dockers for Kafka with Zookeeper and Elasticsearch and will populate these services with some example data. [BUG: You may have to run this script twice!]
cd src/storming_search OR src/luwak_search
run mvn package
./run_search_topology.sh [BUG: You need to push lots of documents to Kafka or you'll get an error]
In a seperate terminal, activate the frontend by calling ./run.py from src/frontend

Deploy to AWS

Prerequisites: You need

The aws cli
Python boto3
Set your default configurations by calling "aws configure"

Steps:

cd src/aws_config
run ./create_clusters.py --help to get instructions about this AWS creation script and follow instructions. You may wish to edit a few things in this file, such as the tag_prefix which gives a unique custom prefix to each service.
src/aws_config/configure contains scripts to configure each of the individual services; you'll need to run each of these to configure the resource.

Once resources are created, you can run

./discover.py

NOTE: You should modify the tag-prefix in this file to match the tag prefix you've set elsewhere. [FEATURE: Tag should be set in a GLOBAL CONFIG file.]

Configuring Redis

Install redis on same server as UI and modify the bind interface redis conf -- set bind 0.0.0.0.

sudo apt-get install redis-server
sudo vi /etc/redis/redis.conf

If you want to use a seperate redis instance for benchmarking, you should repeat the above step on a different AWS machine. [FEATURE: Add the redis config to the deployment scripts.]

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
aws_config		aws_config
config		config
data		data
src		src
test		test
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

straw

What's included:

Architechture

Getting started

Running locally

Deploy to AWS

Configuring Redis

About

Releases

Packages

Languages

License

joelbrice/straw

Folders and files

Latest commit

History

Repository files navigation

straw

What's included:

Architechture

Getting started

Running locally

Deploy to AWS

Configuring Redis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages