Action ML Logo

Table of contents

Introduction to Spark

Spark is a fast in-memory distributed compute engine. By in-memory we mean it does not keep intermediate results in file as with Hadoop's Mapreduce Engine. PredictionIO supports all modes of running Spark including the "Yarn modes".

The key modes you'll need to understand in the beginning are:

It is common to use PredictionIO in local mode to do things like pio import ... or pio export .... Standalone Cluster mode is used for training. This is not a rule and so you can choose. Some configuration options will make more sense if we describe the terminology and use with Spark.

Spark has:

Various configuration parameters allow us to control all aspects of how cores, memory, and physical machines are partitioned to match the needs of the algorithm.

Local Mode Spark

Spark has a great many configuration parameters, by default master=local[n], where n = the number of cores to be allocated will create one process with threads allocated per core in the following manner

image

Spark Standalone Cluster Mode

Spark uses a cluster of Workers and a Master for large horizontally distributed calculations. These are still, by default, in-memory but the memory can be spread among Executors in the cluster.

image

Spark Parameters

First a note about memory, the most common config and setup question for Spark

There are many ways to tune how Spark runs but with PIO you'll notice these first:

Caveates

Spark is fast but gobbles memory so don't scrimp and be careful running driver and executor on the same machine, which will need roughly double the memory of running them on 2 different machines.

In-memory but distributed computation means that (all other things being equal) if you have 1 TB of data, then the cluster as a whole may need 1TB of memory.