Action ML Logo

Table of contents

PredictionIO Standalone Server Guide: The Driver Machine

This is a guide to setting up the PredictionIO model training machine for templates like the Universal Recommender, which only use Spark for pio train. For the UR there is no need to run this on more than one machine since the input data, model (created by pio train), and queries are using shared services. This means the Spark "driver" (pio train) can be run on a temporary machine that is created, trained on, then destroyed along with a temporary Spark cluster. This will have no effect on the other parts of the systems that ingest data and return query results. At the end of this guide we will spin up a Spark cluster and offload the majority of training work to the cluster, then take it offline so it costs nothing while idle.

This machine will run no services itself. It expects to connect to external HDFS, HBase, and Spark and to run pio train to create a model, which is stored on some shared service (Elasticsearch in the case of the UR).

Focus on this part of the standard PIO workflow.



Create an instance on AWS or other cloud PaaS provider, and make sure the machine has enough memory to run the training part of pio. For the UR this will vary greatly from 16g minimum upwards. This will be something like an r3.xlarge or r3.2xlarge. The machine should match the Spark Executor machines for memory size since the driver and executors need roughly the same amount.

Prep the Machine

Read and follow the Small HA Cluster instructions but note that we need instalation jars only for getting configuration information, scripts, or client launcher code (in the case of Spark). Do not start any service on this machine except the EventServer and PredictionServer! This is very important. All services are expected to be already running on other machines.

Configure PIO

Edit /usr/local/pio/conf/ replace the contents with the following, making sure to update all server IP addresses to match the masters of the already running remote clusters of Elasticsearch, HDFS, HBase, and Spark:

# Safe config that will work if you expand your cluster later
# Filesystem paths where PredictionIO uses as block storage.
# PredictionIO Storage Configuration
# This section controls programs that make use of PredictionIO's built-in
# storage facilities. Default values are shown below.
# Storage Repositories
# Default is to use PostgreSQL but for clustered scalable setup we'll use
# Elasticsearch
# Need to use HDFS here instead of LOCALFS to enable deploying to 
# machines without the local model
# Storage Data Sources, lower level that repos above, just a simple storage API
# to use
# Elasticsearch Example
# The next line should match the ES in ES config
# For clustered Elasticsearch (use one host/port if not clustered)
# model storage, required to be in hdfs but not really used
# HBase Source config
# Hbase clustered config (use one host/port if not clustered)

Setup Your Template

See the template setup instructions. The Universal Recommender can be installed with its quickstart.

Temporary Spark

In the diagram at the top of the page you will note that if, like The Universal Recommender, your template does not need Spark for serving queries, it can be shutdown except for the pio train process. This can be done by creating and configuring Spark, then using something like AWS "change instance state" to stop this driver machine and all Spark cluster machines. There are AWS APIs for doing this in an automated fashion that we use with Docker and Terraform in our Ops automation tools. Contact ActionML for more information about these tools.

Once Spark is instantiated and running you make execute your template's pio train command and run pio deploy on some other machine that will host the permanent PredictionServer and EventServer.