Spread

Spread is a missing link in the unix toolchain to do simple distributed data mining. Spread partitions data according to key across a fleet. Spread shines when used in conjunction with fleet management/coordination/persistence tools like:

Quickstart

You'll need CMake and Boost to build spread:

sudo apt-get install -y cmake libboost-all-dev

Then you can fetch and install spread:

git clone git@github.com:wavii/spread.git
cd spread
cmake . && make && sudo make install

Examples

Given a hostfile on each machine with the format host1\nhost2\nhost3\n..., here is a distributed wordcount using spread:

tr ' ' '\n' < input | spread -f hosts | sort | uniq -c > output

Spread can also be used as a library for writing your own map programs in C++:

void map(tcp::endpoint& local_endpoint, vector<tcp::endpoint>& remote_endpoints)
{
   asio::io_service io;
   spreader<> s(io, local_endpoint, remote_endpoints, cout);
   string line;
   while (getline(cin, line))
      s << line.substr(line.find("USER: ")); // map the user in a log line
}

Spread works well with cloud concepts. Here's an example of a bash script you could run on each host to operate on S3 data in parallel:

#!/bin/bash

# grab fleet index and size, perhaps this was provided on each machine by Chef
fleetsize=$(wc -l .fleet-hosts | awk '{print $1}')
fleetsize=$(echo "obase=16; $fleetsize" | bc)
id=$(cat .fleet-id)

for uri in `s3cmd ls s3://data-warehouse/ | awk '{print $4}'`; do
   md5id=$(echo $uri | md5sum | awk '{print toupper($1)}')
   remainder=$(echo "ibase=16; $md5id % $fleetsize" | bc)
   if [ $remainder -eq $id ]; then
      s3cmd get --no-progress --force $uri /mnt/input/
   fi
done

# now spread the input we've collected on each machine
cat /mnt/input/* | tr ' ' '\n' | spread -f hosts | sort | uniq -c > output

License

Spread via the MIT License. Have fun!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
include/spread		include/spread
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spread

Quickstart

Examples

License

About

Releases

Packages

Contributors 2

Languages

License

wavii/spread

Folders and files

Latest commit

History

Repository files navigation

Spread

Quickstart

Examples

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages