h1. HBacker

Ruby Tool to Export / Import HBase to S3, HDFS or local file system.

*THIS HAS BARELY BEEN TESTED*

It seeems to work. Still testing it. Haven't exercised the incremental export / import.

*NOTE:* The incremental export/import will *not* handle deletes to HBase. There is no way to know that there was a delete. Our use case does not have deletes. We never delete rows, so its fine for that.

Please feel free to fork, but please do send pull requests or generate issue reports if you find problems.

h2. Dependencies

* hbase 0.20.3 - hbase 0.90.x
* hadoop 0.20.2
* Stargate installed on hbase master
* beanstalkd
* beanstalkd ruby gem
* stalker ruby gem
* AWS SimpleDB
* right_aws ruby gem
* hbase-stargate ruby gem
* right_aws ruby gem
* yard ruby gem
* bundler ruby gem

h2. Setting up Prerequisites

h3. Stargate

From the hbase_stargate docs:

# Download and unpack the most recent release of HBase from http://hadoop.apache.org/hbase/releases.html#Download
# Edit /conf/hbase-env.sh and uncomment/modify the following line to correspond to your Java home path: export JAVA_HOME=/usr/lib/jvm/java-6-sun
# Copy /contrib/stargate/hbase--stargate.jar into /lib
# Copy all the files in the /contrib/stargate/lib folder into /lib
# Start up HBase: $ ${HBASE_HOME}/bin/start-hbase.sh
# Start up Stargate (append “-p 1234” at the end if you want to change the port from 8080 default): $ /bin/hbase org.apache.hadoop.hbase.stargate.Main

h3. beanstalkd

h4. If you are lucky (Unbuntu 10 onward)

<code>apt-get install beanstalkd</code>

h4. From source (only tested on Ubuntu 9.10)

<pre><code>
sudo apt-get update
sudo apt-get install libevent-1.4-2 libevent-dev 
wget --no-check-certificate https://github.com/downloads/kr/beanstalkd/beanstalkd-1.4.6.tar.gz
tar xzf beanstalkd-1.4.6.tar.gz
cd beanstalkd-1.4.6
./configure
make
</code></pre>

h3. Configure HDFS to have S3 credentials

Assuming you will be using S3 for storing the backups, you'll need to give hadoop hdfs proper access to S3. You will need to update ${HADOOP_HOME}/conf/hdfs-site.xml with something like:

<pre><code>
  <property>
    <name>fs.s3.awsAccessKeyId</name>
    <value>YourAwsAccessKeyId</value>
  </property>

  <property>
    <name>fs.s3.awsSecretAccessKey</name>
    <value>YourAwsSecretAccessKey</value>
  </property>

  <property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>YourAwsAccessKeyId</value>
  </property>

  <property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>YourAwsSecretAccessKey</value>
  </property>
</code></pre>

You can use AWS IAM to create a user and policy to just allow Hadoop to access specific buckets and so on. That's a tutorial all on its own.

h2. Configure AWS credentials for Hbacker to access  AWS SimpleDB

You will also need to create a yaml file with suitable AWS credentials to access SimpleDB. By default Hbacker will look for the file in <code>~/.aws/aws.config</code>. The format of that file is:

<pre><code>
	access_key_id: YourAwsAccessKeyId
	secret_access_key: YourAwsSecretAccessKey
</code></pre>
They don't have to be the same keys as Hadoop HDFS uses.

h2. Starting things

h3. Have beanstalkd running

<code>beanstalkd -d</code> will run beanstalk in daemon mode on port 11300

A nice tool to monitor what's going on with your beanstalkd is <code>https://github.com/dustin/beanstalk-tools</code>
Assuming you are within the directory of beanstalk-tools you can execute:
<code>./bin/beanstalk-queue-stats.rb localhost:11300</code>

h3. Start the workers

*YOU MUST START SOME WORKERS BEFORE RUNNING HBACKER*

A worker job is based on a single HBase Table. Normally it will record the start, status and table schema to SimpleDb and then run the Hadoop job to export or import the table. It waits around to handle errors and record the completion.

The number of workers you spawn should be around the number of simultaneous Tables your Hadoop Cluster can operate on parallel. A big table (having a lot of rows and/or non-nil columns) will use multipe Map / Reduce jobs at once. So its a bit hard to figure it all out in advance.

The Hbacker master process will stop making more job requests if the Hadoop cluster gets too busy or if the Hbacker worker queue starts to backup. But its not very scientific yet... There are a bunch of hbacker command line parameters you can override some of this behavior.

Right now the easiest way to start the workers is something like this on the machine that will be making the hadoop requests (The hbase master works fine for this). Run it as a user that has the proper permissions to run the hadoop hbase jobs


This will run 16 workers:
<code>for i in {1..4}; do bundle exec bin/hbacker_worker  1> LOGS/worker_$i.log 2>&1 & done</code>

h3. Start the main process

Best to run this as the user who has hadoop access to run the hadoop map reduce jobs. 

hbacker help will tell you the commands available

hbacker help export will tell you the options for the export (backup) command

Sample export command:
This will export all tables from the cluster whose master is hbase-master.example.com to S3 bucket example-hbase-backups. It will create a "sub-directory" in the bucket whose name will be the timestamp based somewhat on "Time.now.utc" and then put a "sub-directory" per Hbase table under example-hbase-backups/<timestamp>/
	
<code>hbacker export -H hbase-master.example.com -D s3n://example-hbase-backups/ --mapred-max-jobs=18 -a &> main.log &</code>

h2. Issues and Cavets

* Export / Import to/from HDFS should work, but is untested.
* When doing Export / Import to/from HDFS, the command history is stored in a directory within the current directory where the hbacker program is run for now. Eventually it will be stored in HDFS as well.

h2. Example Usage

You only need the <code>bundle exec</code> if you are running this within the hbacker source directory. If its installed as a gem you would not prefix the command with <code>bundle exec bin/</code>
h3. Startup 4 workers

<code>for i in {1..4}; do bundle exec bin/hbacker_worker  > /tmp/dump/worker_${i}.log 2>&1 &  done</code>

or

<code>for i in {1..4}; do hbacker_worker  > /tmp/dump/worker_${i}.log 2>&1 &  done</code>

h3. Running export to a local filesystem

<code>bundle exec bin/hbacker export --hadoop_home=/absolute/path/to/hadoop --hbase_home=/absolute/path/to/hbase \
	-H localhost -D file:///tmp/dump/ -t <table-name></code>

or

<code>hbacker export --hadoop_home=/absolute/path/to/hadoop --hbase_home=/absolute/path/to/hbase -H localhost \
	-D file:///tmp/dump/ -t <table-name></code>

Another example:

<code>hbacker export --hadoop_home=/apps/hadoop --hbase_home=/apps/hbase -H localhost -D file:///home/rberger/work/dump/ \
	-t furtive_staging_consumer_events_3413a157-4bec-4c9c-b080-626ce883202d --hbase_port 8090</code>

h3. Running import from a local filesystem

<code>bundle exec bin/hbacker import --session-name=<timestamp-dir> --source-root=file:///tmp/dump/ \
	--import-hbase-host=localhost -H localhost --hadoop_home=/absolute/path/to/hadoop --hbase_home=/absolute/path/to/hbase \
	-t <table-name></code>

or

<code>hbacker import --session-name=<timestamp-dir> --source-root=file:///tmp/dump/ --import-hbase-host=localhost \
	-H localhost --hadoop_home=/absolute/path/to/hadoop --hbase_home=/absolute/path/to/hbase -t <table-name></code>

or

<code>hbacker import --session-name=20110802_060004  --hadoop_home=/apps/hadoop --hbase_home=/apps/hbase -H localhost \
	-I localhost --source-root=file:///home/rberger/work/dump/ \
	-t rob_test_furtive_staging_consumer_events_3413a157-4bec-4c9c-b080-626ce883202d --hbase_port 8090</code>

h2. Todos and other notes for the future

* Have an easy way to continue a backup that was aborted with minimal redundant copying on recovery.
* Easy way to say do an incremental export based on the last export (right now have to manually figure out the parameters for the next increment)
* Make sure Export/Import to/from HDFS work and the command logs are stored there too.
* Replace/Update the standard HBase export/import module to get some metrics out
  	* Number of rows exported/imported
  	* Some kind of hash or other mechanism to use for data integrity checking
* Propose a mechanism to be added to HBase to robustly log deletes. That could be used with backup schemes like this to use a delete log to allow for restores of tables that have had deleted rows.

h2. Copyright

Copyright (c) 2011 Robert J. Berger and Runa, Inc.