h1. HBacker
Ruby Tool to Export / Import HBase to S3, HDFS or local file system.
*THIS HAS BARELY BEEN TESTED*
It seeems to work. Still testing it. Haven't exercised the incremental export / import.
*NOTE:* The incremental export/import will *not* handle deletes to HBase. There is no way to know that there was a delete. Our use case does not have deletes. We never delete rows, so its fine for that.
Please feel free to fork, but please do send pull requests or generate issue reports if you find problems.
h2. Dependencies
* hbase 0.20.3 - hbase 0.90.x
* hadoop 0.20.2
* Stargate installed on hbase master
* beanstalkd
* beanstalkd ruby gem
* stalker ruby gem
* AWS SimpleDB
* right_aws ruby gem
* hbase-stargate ruby gem
* right_aws ruby gem
* yard ruby gem
* bundler ruby gem
h2. Setting up Prerequisites
h3. Stargate
From the hbase_stargate docs:
# Download and unpack the most recent release of HBase from http://hadoop.apache.org/hbase/releases.html#Download
# Edit /conf/hbase-env.sh and uncomment/modify the following line to correspond to your Java home path: export JAVA_HOME=/usr/lib/jvm/java-6-sun
# Copy /contrib/stargate/hbase--stargate.jar into /lib
# Copy all the files in the /contrib/stargate/lib folder into /lib
# Start up HBase: $ ${HBASE_HOME}/bin/start-hbase.sh
# Start up Stargate (append “-p 1234” at the end if you want to change the port from 8080 default): $ /bin/hbase org.apache.hadoop.hbase.stargate.Main
h3. beanstalkd
h4. If you are lucky (Unbuntu 10 onward)
apt-get install beanstalkd
h4. From source (only tested on Ubuntu 9.10)
sudo apt-get update
sudo apt-get install libevent-1.4-2 libevent-dev
wget --no-check-certificate https://github.com/downloads/kr/beanstalkd/beanstalkd-1.4.6.tar.gz
tar xzf beanstalkd-1.4.6.tar.gz
cd beanstalkd-1.4.6
./configure
make
h3. Configure HDFS to have S3 credentials
Assuming you will be using S3 for storing the backups, you'll need to give hadoop hdfs proper access to S3. You will need to update ${HADOOP_HOME}/conf/hdfs-site.xml with something like:
fs.s3.awsAccessKeyId
YourAwsAccessKeyId
fs.s3.awsSecretAccessKey
YourAwsSecretAccessKey
fs.s3n.awsAccessKeyId
YourAwsAccessKeyId
fs.s3n.awsSecretAccessKey
YourAwsSecretAccessKey
You can use AWS IAM to create a user and policy to just allow Hadoop to access specific buckets and so on. That's a tutorial all on its own.
h2. Configure AWS credentials for Hbacker to access AWS SimpleDB
You will also need to create a yaml file with suitable AWS credentials to access SimpleDB. By default Hbacker will look for the file in ~/.aws/aws.config
. The format of that file is:
access_key_id: YourAwsAccessKeyId
secret_access_key: YourAwsSecretAccessKey
They don't have to be the same keys as Hadoop HDFS uses.
h2. Starting things
h3. Have beanstalkd running
beanstalkd -d
will run beanstalk in daemon mode on port 11300
A nice tool to monitor what's going on with your beanstalkd is https://github.com/dustin/beanstalk-tools
Assuming you are within the directory of beanstalk-tools you can execute:
./bin/beanstalk-queue-stats.rb localhost:11300
h3. Start the workers
*YOU MUST START SOME WORKERS BEFORE RUNNING HBACKER*
A worker job is based on a single HBase Table. Normally it will record the start, status and table schema to SimpleDb and then run the Hadoop job to export or import the table. It waits around to handle errors and record the completion.
The number of workers you spawn should be around the number of simultaneous Tables your Hadoop Cluster can operate on parallel. A big table (having a lot of rows and/or non-nil columns) will use multipe Map / Reduce jobs at once. So its a bit hard to figure it all out in advance.
The Hbacker master process will stop making more job requests if the Hadoop cluster gets too busy or if the Hbacker worker queue starts to backup. But its not very scientific yet... There are a bunch of hbacker command line parameters you can override some of this behavior.
Right now the easiest way to start the workers is something like this on the machine that will be making the hadoop requests (The hbase master works fine for this). Run it as a user that has the proper permissions to run the hadoop hbase jobs
This will run 16 workers:
for i in {1..4}; do bundle exec bin/hbacker_worker 1> LOGS/worker_$i.log 2>&1 & done
h3. Start the main process
Best to run this as the user who has hadoop access to run the hadoop map reduce jobs.
hbacker help will tell you the commands available
hbacker help export will tell you the options for the export (backup) command
Sample export command:
This will export all tables from the cluster whose master is hbase-master.example.com to S3 bucket example-hbase-backups. It will create a "sub-directory" in the bucket whose name will be the timestamp based somewhat on "Time.now.utc" and then put a "sub-directory" per Hbase table under example-hbase-backups/hbacker export -H hbase-master.example.com -D s3n://example-hbase-backups/ --mapred-max-jobs=18 -a &> main.log &
h2. Issues and Cavets
* Export / Import to/from HDFS should work, but is untested.
* When doing Export / Import to/from HDFS, the command history is stored in a directory within the current directory where the hbacker program is run for now. Eventually it will be stored in HDFS as well.
h2. Example Usage
You only need the bundle exec
if you are running this within the hbacker source directory. If its installed as a gem you would not prefix the command with bundle exec bin/
h3. Startup 4 workers
for i in {1..4}; do bundle exec bin/hbacker_worker > /tmp/dump/worker_${i}.log 2>&1 & done
or
for i in {1..4}; do hbacker_worker > /tmp/dump/worker_${i}.log 2>&1 & done
h3. Running export to a local filesystem
bundle exec bin/hbacker export --hadoop_home=/absolute/path/to/hadoop --hbase_home=/absolute/path/to/hbase \
-H localhost -D file:///tmp/dump/ -t
or
hbacker export --hadoop_home=/absolute/path/to/hadoop --hbase_home=/absolute/path/to/hbase -H localhost \
-D file:///tmp/dump/ -t
Another example:
hbacker export --hadoop_home=/apps/hadoop --hbase_home=/apps/hbase -H localhost -D file:///home/rberger/work/dump/ \
-t furtive_staging_consumer_events_3413a157-4bec-4c9c-b080-626ce883202d --hbase_port 8090
h3. Running import from a local filesystem
bundle exec bin/hbacker import --session-name= --source-root=file:///tmp/dump/ \
--import-hbase-host=localhost -H localhost --hadoop_home=/absolute/path/to/hadoop --hbase_home=/absolute/path/to/hbase \
-t
or
hbacker import --session-name= --source-root=file:///tmp/dump/ --import-hbase-host=localhost \
-H localhost --hadoop_home=/absolute/path/to/hadoop --hbase_home=/absolute/path/to/hbase -t
or
hbacker import --session-name=20110802_060004 --hadoop_home=/apps/hadoop --hbase_home=/apps/hbase -H localhost \
-I localhost --source-root=file:///home/rberger/work/dump/ \
-t rob_test_furtive_staging_consumer_events_3413a157-4bec-4c9c-b080-626ce883202d --hbase_port 8090
h2. Todos and other notes for the future
* Have an easy way to continue a backup that was aborted with minimal redundant copying on recovery.
* Easy way to say do an incremental export based on the last export (right now have to manually figure out the parameters for the next increment)
* Make sure Export/Import to/from HDFS work and the command logs are stored there too.
* Replace/Update the standard HBase export/import module to get some metrics out
* Number of rows exported/imported
* Some kind of hash or other mechanism to use for data integrity checking
* Propose a mechanism to be added to HBase to robustly log deletes. That could be used with backup schemes like this to use a delete log to allow for restores of tables that have had deleted rows.
h2. Copyright
Copyright (c) 2011 Robert J. Berger and Runa, Inc.