lodmill-ld

Overview

This file contains documentation and the configuration setup we used to set up our small Hadoop (5 nodes) and Elasticsearch (3 nodes) clusters. All machines are running Ubuntu 12.04.2 LTS server edition (‘precise’) and Java 7. We’re using Elasticsearch 1.1.0, Hadoop 2.3.0, and Puppet 2.7.22 to set up the Hadoop nodes.

Hadoop and Puppet

We used Puppet to install Hadoop on the nodes, see http://puppetlabs.com/puppet/what-is-puppet. First, we need to install Puppet on all machines. To get a current version of Puppet, add the Puppet repositories, see also http://docs.puppetlabs.com/guides/puppetlabs_package_repositories.html#for-debian-and-ubuntu as described for the master and the agents below.

Master

wget http://apt.puppetlabs.com/puppetlabs-release-precise.deb
sudo dpkg -i puppetlabs-release-precise.deb
sudo apt-get update
sudo apt-get install puppetmaster=2.7.22-1puppetlabs1 puppet-common=2.7.22-1puppetlabs1 puppetmaster-common=2.7.22-1puppetlabs1 puppet=2.7.22-1puppetlabs1
cp -R puppet /etc/puppet
cd /etc/puppet/modules
git clone https://github.com/fsteeg/puppet-java.git java
git clone https://github.com/fsteeg/puppet-hadoop.git hadoop
ssh-keygen ; cp ~/.ssh /etc/puppet/modules/hadoop/ssh
cd /etc/puppet/modules/java/files
wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com" "http://download.oracle.com/otn-pub/java/jdk/7u21-b11/jdk-7u21-linux-x64.tar.gz"
cd /etc/puppet/modules/hadoop/files
wget http://archive.apache.org/dist/hadoop/core/hadoop-2.3.0/hadoop-2.3.0.tar.gz
remove/comment the localhost mappings (127.×.×.x) from the /etc/hosts file
configure the puppet agents in puppet/manifests/nodes.pp
configure the Java version to use in /etc/puppet/modules/java/manifests/params.pp
configure the cluster’s Hadoop version, master, and slave nodes in /etc/puppet/modules/hadoop/manifests/params.pp (use the hostnames, not the aliases, i.e. the first value after the IP in /etc/hosts, not the second; don’t include the master in the slave nodes)
start the puppet master: sudo puppet master --mkusers

Agents

wget http://apt.puppetlabs.com/puppetlabs-release-precise.deb
sudo dpkg -i puppetlabs-release-precise.deb
sudo apt-get update
sudo apt-get install puppet=2.7.22-1puppetlabs1 puppet-common=2.7.22-1puppetlabs1
remove/comment the localhost mappings (127.×.×.x) from the /etc/hosts file

Both

from each node (master and clients), request a certificate from the master: puppet agent --test --server weywot1.hbz-nrw.de (weywot1.hbz-nrw.de is the master)
if this fails with certificate issues run puppet cert clean weywot1.hbz-nrw.de on the master and rm -rf /var/lib/puppet/ssl on the clients
on the master, sign the certificates: sudo puppet cert -s weywot1.hbz-nrw.de weywot2.hbz-nrw.de weywot3.hbz-nrw.de weywot4.hbz-nrw.de weywot5.hbz-nrw.de
from each node (master and clients), set everything up: puppet agent --test --server weywot1.hbz-nrw.de
become the Hadoop user: sudo su hduser
the automated formatting of a new HDFS currently fails, see http://projects.puppetlabs.com/issues/5876, so we do it manually on the master and the clients: bin/hadoop namenode -format
start all daemons: cd /opt/hadoop/hadoop; sh bin/start-all.sh
visit the admin UIs at http://<master>:50070/ and http://<master>:50030/ to check if the cluster is up
copy the sample job files from the sample directory to /opt/hadoop/hadoop and run it: sh temperature-sample.sh

After everything is set up and changes are made to the Puppet scripts, run puppet agent --test --server weywot1.hbz-nrw.de on all machines to update the setup.

Elasticsearch

Install Elasticsearch

sudo apt-get install java7-jdk
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.0.deb
sudo dpkg -i elasticsearch-1.1.0.deb
set cluster.name : quaoar in /etc/elasticsearch/elasticsearch.yml
tweak memory settings in /etc/init.d/elasticsearch (we use ES_HEAP_SIZE=64g)
sudo service elasticsearch restart
curl http://localhost:9200

Install the elasticsearch-head plugin

sudo /usr/share/elasticsearch/bin/plugin -install mobz/elasticsearch-head
curl http://localhost:9200/_plugin/head/

Workflow for updates

Our current workflow for automatic updates:

- Transform union catalogue to RDF in N-Triple serialization, trigger update (lod_hdfs_update.sh):
- Copy N-Triples from local file system to HDFS: hadoop fs -copyFromLocal *.nt hdfs://weywot1.hbz-nrw.de:8020/user/hduser/update/
- SSH into the Hadoop cluster and start the updates script: ssh hduserweywot1.hbz-nrw.de “cd /home/sol/git/lodmill/lodmill-ld/doc/scripts; bash -x job_updates.sh”@ (this requires a correct SSH key setup: public key of user logging in should be in hduser’s authorized_keys).
- Delete the updates in the local file system: rm *.nt

The job_updates.sh script calls the conversion and index scripts:

- Use the external LOD (for enrichment) and our updated LOD as the input: sh convert-lobid.sh extlod,update output/json-ld-update
- Index the resulting output in elasticsearch: sh index.sh output/json-ld-update
- Delete the updates in HDFS: hadoop fs -rm update/*

Name		Name	Last commit message	Last commit date
parent directory ..
.settings		.settings
doc		doc
lib		lib
src		src
.classpath		.classpath
.pmd		.pmd
.project		.project
README.textile		README.textile
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lodmill-ld

lodmill-ld

README.textile

Overview

Hadoop and Puppet

Master

Agents

Both

Elasticsearch

Install Elasticsearch

Install the elasticsearch-head plugin

Workflow for updates

Files

lodmill-ld

Directory actions

More options

Directory actions

More options

Latest commit

History

lodmill-ld

Folders and files

parent directory

README.textile

Overview

Hadoop and Puppet

Master

Agents

Both

Elasticsearch

Install Elasticsearch

Install the elasticsearch-head plugin

Workflow for updates