Action ML Logo

Table of contents

All-In-One PIO Setup Guide

This is a guide to setting up Apache PredictionIO 0.11.0 on a single large memory (16g-32g) machine. This will allow "real data" to be processed but will not usually be appropriate focus on horizontal scaling.

Other Guides:

Requirements

In this guide, all services run on a single machine and so share cores and memory. This will limit how much data can be processed and how much load can be handled and so is advised for use as an experiment or development machine.

Here we'll install and setup:

Setup User, SSH, and Host Naming:

  1. Create user for PredictionIO aml in each server

    adduser aml # Give it some password
    
  2. Give the "aml" user sudoers permissions and login to the new user. This setup assumes the aml user as the owner of all services including Spark and Hadoop (HDFS).

    usermod -a -G sudo aml
    sudo su - aml
    
  3. Setup passwordless ssh between all hosts of the cluster. This is a combination of adding all public keys to authorized_keys and making sure that known_hosts includes all cluster hosts, including any host to itself. There must be no prompt generated when any host tries to connect via ssh to any other host. Note: The importance of this cannot be overstated! If ssh does not connect without requiring a password and without asking for confirmation nothing else in the guide will work!

  1. Modify /etc/hosts file and this server. Don't use "localhost" or "127.0.0.1" but use either the lan DNS name or static IP address.

    10.0.0.1 some-master
    

Download Services on all Hosts

Download everything to a temp folder like /tmp/downloads, we will later move them to the final destinations. Note: You may need to install wget with yum or apt-get.

  1. Download:

Setup Java 1.8

  1. Install Java OpenJDK or Oracle JDK for Java 7 or 8, the JRE version is not sufficient.

    sudo apt-get install openjdk-8-jdk
    # for centos
    # sudo yum install java-1.8.0-openjdk
    
    • Note: on Centos/RHEL you may need to install java-1.8.0-openjdk-devel if you get complaints about missing javac or javadoc
  2. Check which versions of Java are installed and pick a 1.7 or greater.

    sudo update-alternatives --config java
    
  3. Set JAVA_HOME env var. Don't include the /bin folder in the path. This can be problematic so if you get complaints about JAVA_HOME you may need to change xxx-env.sh depending on which service complains. For instance hbase-env.sh has a JAVA_HOME setting if HBase complains when starting.

    vim /etc/environment
    # add the following
    export JAVA_HOME=/path/to/open/jdk/jre
    # some would rather add JAVA_HOME to /home/aml/.bashrc
    

Install Services:

  1. Create folders in /opt

    mkdir /opt/hadoop
    mkdir /opt/spark
    mkdir /opt/elasticsearch
    mkdir /opt/hbase
    chown aml:aml /opt/hadoop
    chown aml:aml /opt/spark
    chown aml:aml /opt/elasticsearch
    chown aml:aml /opt/hbase
    
  2. Inside the /tmp/downloads folder, extract all downloaded services.

    tar -xvfz each-tar-filename
    
  3. Move extracted services to their folders.

    sudo mv /tmp/downloads/hadoop-2.7.2 /opt/hadoop/
    sudo mv /tmp/downloads/spark-1.6.3 /opt/spark/
    sudo mv /tmp/downloads/elasticsearch-1.7.6 /opt/elasticsearch/
    sudo mv /tmp/downloads/hbase-1.2.6 /opt/hbase/
    

    Note: Keep version numbers, if you upgrade or downgrade in the future just create new symlinks.

  4. Symlink Folders

    sudo ln -s /opt/hadoop/hadoop-2.7.2 /usr/local/hadoop
    sudo ln -s /opt/spark/spark-1.6.3 /usr/local/spark
    sudo ln -s /opt/elasticsearch/elasticsearch-1.7.6 /usr/local/elasticsearch
    sudo ln -s /opt/hbase/hbase-1.2.6 /usr/local/hbase
    sudo ln -s /home/aml/pio /usr/local/pio
    

Setup Hadoop Pseudo-Distributed Mode

Read this tutorial especially the Pseudo-Distributed Mode.

some-master ```

Setup Spark Cluster.

Setup Elasticsearch Cluster

```
cluster.name: some-cluster-name
discovery.zen.ping.multicast.enabled: false # most cloud services don't allow multicast
discovery.zen.ping.unicast.hosts: ["some-master"] # add all hosts, masters and/or data nodes
```

Setup HBase

This tutorial is the best guide, many others produce incorrect results. We are using one host in this guide so no multi-host copying is needed but this setup most closely resembles a clustered setup.

Configure with these changes to /usr/local/hbase/conf:

Setup PredictionIO

PredictionIO is a source only release so you will need to build it.

8. Setup Your Template

See the template setup instructions. The Universal Recommender can be installed with its quickstart.