This file contains documentation and the configuration setup we used to set up our small Hadoop (5 nodes) and Elasticsearch (3 nodes) clusters. All machines are running Ubuntu 12.04.2 LTS server edition (‘precise’) and Java 7. We’re using Elasticsearch 1.1.0, Hadoop 2.3.0, and Puppet 2.7.22 to set up the Hadoop nodes.
We used Puppet to install Hadoop on the nodes, see http://puppetlabs.com/puppet/what-is-puppet. First, we need to install Puppet on all machines. To get a current version of Puppet, add the Puppet repositories, see also http://docs.puppetlabs.com/guides/puppetlabs_package_repositories.html#for-debian-and-ubuntu as described for the master and the agents below.
wget http://apt.puppetlabs.com/puppetlabs-release-precise.deb
sudo dpkg -i puppetlabs-release-precise.deb
sudo apt-get update
sudo apt-get install puppetmaster=2.7.22-1puppetlabs1 puppet-common=2.7.22-1puppetlabs1 puppetmaster-common=2.7.22-1puppetlabs1 puppet=2.7.22-1puppetlabs1
cp -R puppet /etc/puppet
cd /etc/puppet/modules
git clone https://github.com/fsteeg/puppet-java.git java
git clone https://github.com/fsteeg/puppet-hadoop.git hadoop
ssh-keygen ; cp ~/.ssh /etc/puppet/modules/hadoop/ssh
cd /etc/puppet/modules/java/files
wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com" "http://download.oracle.com/otn-pub/java/jdk/7u21-b11/jdk-7u21-linux-x64.tar.gz"
cd /etc/puppet/modules/hadoop/files
wget http://archive.apache.org/dist/hadoop/core/hadoop-2.3.0/hadoop-2.3.0.tar.gz
- remove/comment the localhost mappings (127.×.×.x) from the
/etc/hosts
file - configure the puppet agents in
puppet/manifests/nodes.pp
- configure the Java version to use in
/etc/puppet/modules/java/manifests/params.pp
- configure the cluster’s Hadoop version, master, and slave nodes in
/etc/puppet/modules/hadoop/manifests/params.pp
(use the hostnames, not the aliases, i.e. the first value after the IP in /etc/hosts, not the second; don’t include the master in the slave nodes) - start the puppet master:
sudo puppet master --mkusers
wget http://apt.puppetlabs.com/puppetlabs-release-precise.deb
sudo dpkg -i puppetlabs-release-precise.deb
sudo apt-get update
sudo apt-get install puppet=2.7.22-1puppetlabs1 puppet-common=2.7.22-1puppetlabs1
- remove/comment the localhost mappings (127.×.×.x) from the
/etc/hosts
file
- from each node (master and clients), request a certificate from the master:
puppet agent --test --server weywot1.hbz-nrw.de
(weywot1.hbz-nrw.de is the master) - if this fails with certificate issues run
puppet cert clean weywot1.hbz-nrw.de
on the master andrm -rf /var/lib/puppet/ssl
on the clients - on the master, sign the certificates:
sudo puppet cert -s weywot1.hbz-nrw.de weywot2.hbz-nrw.de weywot3.hbz-nrw.de weywot4.hbz-nrw.de weywot5.hbz-nrw.de
- from each node (master and clients), set everything up:
puppet agent --test --server weywot1.hbz-nrw.de
- become the Hadoop user:
sudo su hduser
- the automated formatting of a new HDFS currently fails, see http://projects.puppetlabs.com/issues/5876, so we do it manually on the master and the clients:
bin/hadoop namenode -format
- start all daemons:
cd /opt/hadoop/hadoop; sh bin/start-all.sh
- visit the admin UIs at
http://<master>:50070/
andhttp://<master>:50030/
to check if the cluster is up - copy the sample job files from the
sample
directory to/opt/hadoop/hadoop
and run it:sh temperature-sample.sh
After everything is set up and changes are made to the Puppet scripts, run puppet agent --test --server weywot1.hbz-nrw.de
on all machines to update the setup.
sudo apt-get install java7-jdk
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.0.deb
sudo dpkg -i elasticsearch-1.1.0.deb
- set
cluster.name : quaoar
in/etc/elasticsearch/elasticsearch.yml
- tweak memory settings in
/etc/init.d/elasticsearch
(we useES_HEAP_SIZE=64g
) sudo service elasticsearch restart
curl http://localhost:9200
sudo /usr/share/elasticsearch/bin/plugin -install mobz/elasticsearch-head
curl http://localhost:9200/_plugin/head/
Our current workflow for automatic updates:
- Transform union catalogue to RDF in N-Triple serialization, trigger update (lod_hdfs_update.sh):
- Copy N-Triples from local file system to HDFS: hadoop fs -copyFromLocal *.nt hdfs://weywot1.hbz-nrw.de:8020/user/hduser/update/
- SSH into the Hadoop cluster and start the updates script: ssh hduser
weywot1.hbz-nrw.de “cd /home/sol/git/lodmill/lodmill-ld/doc/scripts; bash -x job_updates.sh”@ (this requires a correct SSH key setup: public key of user logging in should be in hduser’s authorized_keys).
- Delete the updates in the local file system: rm *.nt
The job_updates.sh script calls the conversion and index scripts:
- Use the external LOD (for enrichment) and our updated LOD as the input: sh convert-lobid.sh extlod,update output/json-ld-update
- Index the resulting output in elasticsearch: sh index.sh output/json-ld-update
- Delete the updates in HDFS: hadoop fs -rm update/*