Spark is a fast in-memory distributed compute engine. By in-memory we mean it does not keep intermediate results in file as with Hadoop's Mapreduce Engine. PredictionIO supports all modes of running Spark including the "Yarn modes".
The key modes you'll need to understand in the beginning are:
It is common to use PredictionIO in local mode to do things like
pio import ... or
pio export .... Standalone Cluster mode is used for training. This is not a rule and so you can choose. Some configuration options will make more sense if we describe the terminology and use with Spark.
Various configuration parameters allow us to control all aspects of how cores, memory, and physical machines are partitioned to match the needs of the algorithm.
Spark has a great many configuration parameters, by default master=local[n], where n = the number of cores to be allocated will create one process with threads allocated per core in the following manner
Spark uses a cluster of Workers and a Master for large horizontally distributed calculations. These are still, by default, in-memory but the memory can be spread among Executors in the cluster.
First a note about memory, the most common config and setup question for Spark
hashmapsare created and used by all Executors, as with the Universal Recommender, then there must be enough memory to hold the
hashmapon each Executor. Think of this as base requirement on top of which you need to account for transient memory needs.
There are many ways to tune how Spark runs but with PIO you'll notice these first:
--" and follow with parameters passed directly to SparkSubmit. These fill in the SparkConf available to all distributed Spark code and also control how the resources of the cluster are allocated. Using an external machine will imply a standalone clsuter and look like
-- --master=spark://some-ip:7077All params after the "
--" will be treated as SparkSubmit params.
--master=xxxx: This tells SparkSubmit where to look for the Master and therefore what type of mode Spark is in. By default in PIO this is
master=localfor local use of all available cores.
--driver-memory xg: This tells Spark to give "x" gigabytes of memory to the Driver, since it is passed to SparkSubmit it only affects the part of the driver controlled by Spark. In rare cases you may need to give more memory to the JVM to launch the driver so this is not covered here.
--executor-memory xg: This tells Spark to give "x" gigabytes of memory to the Executor. In a clustered setting where an entire machine is dedicated to being a Worker with one Executor you would want to give almost all memory, leaving only enough for the OS to run. When running on a single machine great care should be taken to not over allocate memory for Driver and Executors or you will over-constrain the machine allocating more memory than you have, causing constant crashes with some form of an out-of-memory exception.
Spark is fast but gobbles memory so don't scrimp and be careful running driver and executor on the same machine, which will need roughly double the memory of running them on 2 different machines.
In-memory but distributed computation means that (all other things being equal) if you have 1 TB of data, then the cluster as a whole may need 1TB of memory.