Skip to content
This repository has been archived by the owner on Jun 23, 2020. It is now read-only.

Latest commit

 

History

History

lodmill-ld

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Overview

This file contains documentation and the configuration setup we used to set up our small Hadoop (5 nodes) and Elasticsearch (3 nodes) clusters. All machines are running Ubuntu 12.04.2 LTS server edition (‘precise’) and Java 7. We’re using Elasticsearch 1.1.0, Hadoop 2.3.0, and Puppet 2.7.22 to set up the Hadoop nodes.

Hadoop and Puppet

We used Puppet to install Hadoop on the nodes, see http://puppetlabs.com/puppet/what-is-puppet. First, we need to install Puppet on all machines. To get a current version of Puppet, add the Puppet repositories, see also http://docs.puppetlabs.com/guides/puppetlabs_package_repositories.html#for-debian-and-ubuntu as described for the master and the agents below.

Master

  • wget http://apt.puppetlabs.com/puppetlabs-release-precise.deb
  • sudo dpkg -i puppetlabs-release-precise.deb
  • sudo apt-get update
  • sudo apt-get install puppetmaster=2.7.22-1puppetlabs1 puppet-common=2.7.22-1puppetlabs1 puppetmaster-common=2.7.22-1puppetlabs1 puppet=2.7.22-1puppetlabs1
  • cp -R puppet /etc/puppet
  • cd /etc/puppet/modules
  • git clone https://github.com/fsteeg/puppet-java.git java
  • git clone https://github.com/fsteeg/puppet-hadoop.git hadoop
  • ssh-keygen ; cp ~/.ssh /etc/puppet/modules/hadoop/ssh
  • cd /etc/puppet/modules/java/files
  • wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com" "http://download.oracle.com/otn-pub/java/jdk/7u21-b11/jdk-7u21-linux-x64.tar.gz"
  • cd /etc/puppet/modules/hadoop/files
  • wget http://archive.apache.org/dist/hadoop/core/hadoop-2.3.0/hadoop-2.3.0.tar.gz
  • remove/comment the localhost mappings (127.×.×.x) from the /etc/hosts file
  • configure the puppet agents in puppet/manifests/nodes.pp
  • configure the Java version to use in /etc/puppet/modules/java/manifests/params.pp
  • configure the cluster’s Hadoop version, master, and slave nodes in /etc/puppet/modules/hadoop/manifests/params.pp (use the hostnames, not the aliases, i.e. the first value after the IP in /etc/hosts, not the second; don’t include the master in the slave nodes)
  • start the puppet master: sudo puppet master --mkusers

Agents

  • wget http://apt.puppetlabs.com/puppetlabs-release-precise.deb
  • sudo dpkg -i puppetlabs-release-precise.deb
  • sudo apt-get update
  • sudo apt-get install puppet=2.7.22-1puppetlabs1 puppet-common=2.7.22-1puppetlabs1
  • remove/comment the localhost mappings (127.×.×.x) from the /etc/hosts file

Both

  • from each node (master and clients), request a certificate from the master: puppet agent --test --server weywot1.hbz-nrw.de (weywot1.hbz-nrw.de is the master)
  • if this fails with certificate issues run puppet cert clean weywot1.hbz-nrw.de on the master and rm -rf /var/lib/puppet/ssl on the clients
  • on the master, sign the certificates: sudo puppet cert -s weywot1.hbz-nrw.de weywot2.hbz-nrw.de weywot3.hbz-nrw.de weywot4.hbz-nrw.de weywot5.hbz-nrw.de
  • from each node (master and clients), set everything up: puppet agent --test --server weywot1.hbz-nrw.de
  • become the Hadoop user: sudo su hduser
  • the automated formatting of a new HDFS currently fails, see http://projects.puppetlabs.com/issues/5876, so we do it manually on the master and the clients: bin/hadoop namenode -format
  • start all daemons: cd /opt/hadoop/hadoop; sh bin/start-all.sh
  • visit the admin UIs at http://<master>:50070/ and http://<master>:50030/ to check if the cluster is up
  • copy the sample job files from the sample directory to /opt/hadoop/hadoop and run it: sh temperature-sample.sh

After everything is set up and changes are made to the Puppet scripts, run puppet agent --test --server weywot1.hbz-nrw.de on all machines to update the setup.

Elasticsearch

Install Elasticsearch

  • sudo apt-get install java7-jdk
  • wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.0.deb
  • sudo dpkg -i elasticsearch-1.1.0.deb
  • set cluster.name : quaoar in /etc/elasticsearch/elasticsearch.yml
  • tweak memory settings in /etc/init.d/elasticsearch (we use ES_HEAP_SIZE=64g)
  • sudo service elasticsearch restart
  • curl http://localhost:9200

Install the elasticsearch-head plugin

  • sudo /usr/share/elasticsearch/bin/plugin -install mobz/elasticsearch-head
  • curl http://localhost:9200/_plugin/head/

Workflow for updates

Our current workflow for automatic updates:

- Transform union catalogue to RDF in N-Triple serialization, trigger update (lod_hdfs_update.sh):
- Copy N-Triples from local file system to HDFS: hadoop fs -copyFromLocal *.nt hdfs://weywot1.hbz-nrw.de:8020/user/hduser/update/
- SSH into the Hadoop cluster and start the updates script: ssh hduserweywot1.hbz-nrw.de “cd /home/sol/git/lodmill/lodmill-ld/doc/scripts; bash -x job_updates.sh”@ (this requires a correct SSH key setup: public key of user logging in should be in hduser’s authorized_keys).
- Delete the updates in the local file system: rm *.nt

The job_updates.sh script calls the conversion and index scripts:

- Use the external LOD (for enrichment) and our updated LOD as the input: sh convert-lobid.sh extlod,update output/json-ld-update
- Index the resulting output in elasticsearch: sh index.sh output/json-ld-update
- Delete the updates in HDFS: hadoop fs -rm update/*