Action ML Logo

Table of contents

Configuration

Engines in Harness follow a pattern that defines defaults for many parameters then allows you to override them in the Engine's JSON config. If further refinement makes sense it is done in the Query.

For instance The default number of results returned is 20, this can be overridden in the UR config JSON, which can later be overridden in any query.

Business Rules can also be specified in the Engine's config or in the query. The use case here might be to only include items where "available": "true" and this should be used in every query unless the Query overrides or add new rules.

Configuration Sections

The UR Configuration is written in Harness JSON (JSON extended to allow substitution of values with data from environmental variables) and divided into sections for:

Simplest UR Configuration

Imagine an ECom version of the UR that only watches for "buys" and product detail "views". To be sure there are many other ways to use a recommender but this is a good, simple example.

We will make heavy use of default settings that have been chosen in the Universal Recommender code and only set required config and parameters.

{
    "engineId": "ecom_ur",
    "engineFactory": "com.actionml.engines.ur.UREngine",
    "sparkConf": {
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
        "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
        "spark.kryo.referenceTracking": "false",
        "spark.kryoserializer.buffer": "300m",
        "spark.executor.memory": "20g",
        "spark.driver.memory": "10g",
        "spark.es.index.auto.create": "true",
        "spark.es.nodes": "elasticsearch-host",
        "spark.es.nodes.wan.only": "true"
    },
    "algorithm":{
        "indicators": [ 
            {
                "name": "buy"
            },{
                "name": "view"
            }
        ],
    }
}

Here we are telling Harness how to create a UR instance and telling the UR Instance what types of input to expect. NOTE: the first indicator is the primary one, when it comes in as an input Event it has item-ids that will be recommended. The secondary indicator will also come in as an input Event and will make the UR more predictive since it gives more information about user preferences. Secondary indicators do not have to come in with the same item-ids as the primary so maybe it is easier to send a page-id than a product-id (sent with the "buy" Events). The secondary indicator will be just as helpful.

Depending on the size of your data this config might work just fine for an ECom application and if the dataset size grows too large we just increase memory given to Spark.

It is highly recommended that you start with this type of config before tuning the numerous values that may (or may not) yield better results.

Complete UR Engine Configuration Specification

How to read config settings:

{
    "engineId": "<some-unique-id>",
    "engineFactory": "com.actionml.engines.ur.UREngine",
    "modelContainer": "</some/model/path>",
    "mirrorType": "localfs" | "hdfs",
    "mirrorLocation": "</some/mirror/path>",
    "dataset": {
        "ttl": "<356 days>",
    },
    "sparkConf": {
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
        "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
        "spark.kryo.referenceTracking": "false",
        "spark.kryoserializer.buffer": "300m",
        "spark.executor.memory": "<4g>",
        "spark.es.index.auto.create": "true",
        "spark.es.nodes": "<node1>,<node2>",
        "spark.es.nodes.wan.only": "true"
    },
    "algorithm":{
        "indicators": [ 
            {
                "name": "<indicator-event-name>",
                "maxCorrelatorsPerItem": <some-int>",
                "minLLR": <some-int>,
                "maxIndicatorsPerQuery": <some-int>
            },
            ...
        ],
        "blacklistIndicators": ["<list>", "<of>", "<indicator>", "<names>"],
        "maxEventsPerEventType": <some-int>,
        "maxCorrelatorsPerEventType": "<some-int>",
        "maxQueryEvents": <some-int>,
        "num": <some-number-of-results-to-return>,
        "seed": <some-int>,
        "recsModel": "all" | "collabFiltering" | "backfill",
        "expireDateName": "<some-expire-date-property-name>",
        "availableDateName": "<some-available-date-property-name>",
        "dateName": "<dateFieldName>",
        "userbias": <-maxFloat..maxFloat>,
        "itembias": <-maxFloat..maxFloat>,
        "returnSelf": true | false,
        "rankings": [
          {
            "name": <some-field-name>,
            "type": "popular" | "trending" | "hot",
            "indicatorNames": 
                ["<some-indicator-1", "some-indicator-2", ...],
            "duration": <"365 days">
          } // ONLY ONE SUPPORTED
        ],
        “rules”: [
          {
            “name”: ”<some-property-name>”,
            “values”: [“value1”, ...],
            “bias”: -maxFloat..maxFloat,
          },
          ...
        ]
        "numESWriteConnections": 100,      }
    }
}

Default UR Settings

{
    "engineId": REQUIRED,
    "engineFactory": "com.actionml.engines.ur.UREngine",
    "modelContainer": NONE,
    "mirrorType": NONE,
    "mirrorLocation": NONE,
    "dataset": { //; OPTIONAL
        "ttl": "356 days",
    },
    "sparkConf": {
        "master":REQUIRED,
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
        "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
        "spark.kryo.referenceTracking": "false",
        "spark.kryoserializer.buffer": "300m",
        "spark.executor.memory": REQUIRED,
        "spark.driver.memroy": REQUIRED,
        "spark.es.index.auto.create": "true",
        "spark.es.nodes": "localhost",
        "spark.es.nodes.wan.only": "true"
    },
    "algorithm":{
        "indicators": [ 
            {
                "name": ONE OR MORE REQUIRED,
                "maxCorrelatorsPerItem": 50,
                "minLLR": NONE,
                "maxIndicatorsPerQuery": 100
            },
            ...
        ],
        "blacklistIndicators": ["primary-indicator-name"], // OPTIONAL
        "maxEventsPerEventType": 500, // OPTIONAL
        "maxCorrelatorsPerEventType": 50, // OPTIONAL
        "maxQueryEvents": 100, // OPTIONAL
        "num": 20, // OPTIONAL
        "seed": RANDOM,
        "recsModel": "all", // OPTIONAL
        "rankings": [
          { // OPTIONAL
            "name": "popRank",// OPTIONAL
            "type": "popular",// OPTIONAL
            "indicatorNames": ["primary-indicator"],// OPTIONAL
            "duration": "xxx days" // OPTIONAL
          } // ONLY ONE SUPPORTED
        ],
        "expireDateName": NONE,
        "availableDateName": NONE,
        "dateName": NONE,
        "userbias": NONE,
        "itembias": NONE,
        "returnSelf": false, // OPTIONAL
        “rules”: [ NONE ]  // OPTIONAL
        "numESWriteConnections": NONE,
    }
}

Spark Parameters (sparkConf)

The UR uses Spark to update its model. This happens when you execute harness-cli train <ur-engine-id>. This means Spark must be configured with job settings. These must be tuned to fit the dataset and allow Spark jobs to write the model to Elasticsearch. In some cases they configure libraries like Mahout or the Elasticsearch Spark client.

These setting are in addition to Harness's setting in harness-env and may duplicate them.

The meaning of these params can be found in Spark docs or the docs of the various libraries used by the UR. You may need to consult those docs if the above explanation is not sufficient. Spark has other parameters that may aid in running the UR training task but these are best left for expert usage, the above will be sufficient most of the time. See also the ActionML Spark Intro.

Dataset Parameters

The "dataset" section controls how long to keep input data.

Algorithm Parameters

The "algorithm" section controls most of the tuning and config of the UR. Possible values are: