Action ML Logo

Table of contents

Small High Availability Cluster Setup Guide

This is a guide to setting up Apache PredictionIO 0.11.0 in a 3 node cluster with all services running on the 3 cluster machines.

In this guide all services are setup with multiple or standby masters in true clustered mode. To make High Availability complete, a secondary master would need to be setup for HDFS (not described here). Elasticsearch and HBase are setup in High Availability mode (HA) using this guide.

Other Guides:

Requirements

In this guide, all servers share all services, except PredictionIO, which runs only on the master. Setup of multiple EventServers and PredictionServers is done with load-balancers and is out of the scope of this guide.

Here we'll install and setup:

Setup User, SSH, and host naming on All Hosts:

  1. Create user for PredictionIO aml in each server

    adduser aml # Give it some password
    
  2. Give the "aml" user sudoers permissions and login to the new user. This setup assumes the aml user as the owner of all services including Spark and Hadoop (HDFS).

    usermod -a -G sudo aml
    sudo su - aml
    
  3. Setup passwordless ssh between all hosts of the cluster. This is a combination of adding all public keys to authorized_keys and making sure that known_hosts includes all cluster hosts, including any host to itself. There must be no prompt generated when any host tries to connect via ssh to any other host. Note: The importance of this cannot be overstated! If ssh does not connect without requiring a password and without asking for confirmation nothing else in the guide will work!

  1. Modify /etc/hosts file and name each server. Don't use "localhost" or "127.0.0.1" but use either the lan DNS name or static IP address.

    10.0.0.1 some-master
    10.0.0.2 some-slave-1
    10.0.0.3 some-slave-2
    

Download Services on all Hosts

Download everything to a temp folder like /tmp/downloads, we will later move them to the final destinations. Note: You may need to install wget with yum or apt-get.

  1. Download:

Setup Java 1.8

  1. Install Java OpenJDK or Oracle JDK for Java 7 or 8, the JRE version is not sufficient.

    sudo apt-get install openjdk-8-jdk
    # for centos
    # sudo yum install java-1.8.0-openjdk
    
    • Note: on Centos/RHEL you may need to install java-1.8.0-openjdk-devel if you get complaints about missing javac or javadoc
  2. Check which versions of Java are installed and pick a 1.7 or greater.

    sudo update-alternatives --config java
    
  3. Set JAVA_HOME env var. Don't include the /bin folder in the path. This can be problematic so if you get complaints about JAVA_HOME you may need to change xxx-env.sh depending on which service complains. For instance hbase-env.sh has a JAVA_HOME setting if HBase complains when starting.

    vim /etc/environment
    # add the following
    export JAVA_HOME=/path/to/open/jdk/jre
    # some would rather add JAVA_HOME to /home/aml/.bashrc
    

Install Services:

  1. Create folders in /opt

    mkdir /opt/hadoop
    mkdir /opt/spark
    mkdir /opt/elasticsearch
    mkdir /opt/hbase
    chown aml:aml /opt/hadoop
    chown aml:aml /opt/spark
    chown aml:aml /opt/elasticsearch
    chown aml:aml /opt/hbase
    
  2. Inside the /tmp/downloads folder, extract all downloaded services.

    tar -xvfz each-tar-filename
    
  3. Move extracted services to their folders.

    sudo mv /tmp/downloads/hadoop-2.7.2 /opt/hadoop/
    sudo mv /tmp/downloads/spark-1.6.3 /opt/spark/
    sudo mv /tmp/downloads/elasticsearch-1.7.6 /opt/elasticsearch/
    sudo mv /tmp/downloads/hbase-1.2.6 /opt/hbase/
    

    Note: Keep version numbers, if you upgrade or downgrade in the future just create new symlinks.

  4. Symlink Folders

    sudo ln -s /opt/hadoop/hadoop-2.7.2 /usr/local/hadoop
    sudo ln -s /opt/spark/spark-1.6.3 /usr/local/spark
    sudo ln -s /opt/elasticsearch/elasticsearch-1.7.6 /usr/local/elasticsearch
    sudo ln -s /opt/hbase/hbase-1.2.6 /usr/local/hbase
    sudo ln -s /home/aml/pio /usr/local/pio
    

Setup Hadoop Cluster

Hadoop's Distributed File System is core to Spark, HBase, and for staging of data to be imported into the EventServer and as storage for some templates. To become familiar with Hadoop installation read this tutorial

some-master ```

Setup Spark Cluster

some-master ```

Setup Elasticsearch Cluster

Setup HBase Cluster

This tutorial is the best guide, many others produce incorrect results . The primary thing to remember is to install and configure on a single machine, adding all desired hostnames to backupmasters, regionservers, and to the hbase.zookeeper.quorum config param, then copy all code and config to all other machines with something like scp -r ... Every machine will then be identical.

Configure with these changes to /usr/local/hbase/conf

At this point you should see several different processes start on the master and slaves including regionservers and zookeeper servers. If there is an error check the log files referenced in the error message. These log files may reside on any of the hosts as indicated in the file's name.

Note: It is strongly recommend to setup these files in the master /usr/local/hbase folder and then copy all code and sub-folders or the to the slaves. All members of the cluster must have exactly the same code and config

Setup PredictionIO

Setup PIO on the master or on all servers (if you plan to use a load balancer). The Setup must not use the install.sh since you are using clustered services and that script only supports a standalone machine. See the Installing PredictionIO page for instructions.

PredictionIO is a source only release so you will need to build it.

Setup Your Template

See the template setup instructions. The Universal Recommender can be installed with its quickstart.

Scaling for Load Balancers

See PredictionIO Load Balancing