Action ML Logo

Table of contents

Universal Recommender Tuning

The default settings of the UR are good for many purposes but getting optimum results may require tuning and at very least many users will wish to know the meaning of the various tuning params.

UR Parameters v0.6.0

These instructions are for the latest Universal Recommender v0.6.0, which requires you to build Mahout and have Apache PredictionIO installed (not the ActionML branch). See installation instructions for PIO and the UR.

Start Here: Find Your Primary Conversion Indicator

To start using a recommender you must have a primary indicator of user preference. This is sometimes called the "conversion" event. If it is not obvious ask yourself 2 questions:

  1. "What item type do I want to recommend?" For ecom this will be a product, for media it will be a story or video.
  2. "Of all the data I have what is the most pure and unambiguous indication of user preference for this item?" For ecom a "buy" is good, for media a "read" or "watch90" (for watching 90% of a video). Try to avoid ambiguous things like ratings, after all what does a rating of 3 mean for a 1-5 range? Does a rating of 2 mean a like or dislike? If you have ratings then 4-5 is a pretty unambiguous "like" so the other ratings may not apply to a primary indicator—though they may still be useful so read on.

Take the item from #1, the indicator-name from #2 and the user-id and you have the data to create a "primary indicator" of the form (user-id, "indicator-name", item-id).

Secondary Indicators

There must be a "primary indicator" recorded for some number of users. This defines the type of item returned in recommendations and is the thing by which all secondary data is measured. More technically speaking all secondary data is tested for correlation to the primary indicator. Secondary data can be anything that you may think of as giving some insight into the user's taste. If something in the secondary data has no correlation to the primary indicator it will have no effect on recommendations. For instance in an ecom setting you may want "buy" as a primary event. There may be many (but none is also fine) secondary events like (user-id, device-preference, device-id). This can be thought of as a user's device preference and recorded at all logins. If this does not correlate to items bought it will not effect recommendations.

Biases

Biases in query fields can be used to do blend content-based results with collaborative filtering results. They can also be used to implement business rules.

These take the form of boosts and inclusion and exclusion filters where a neutral bias is 1.0. The importance of some part of the query may be boosted by a positive non-zero float. If the bias is < 0 it is considered a filter—meaning no recommendation is made that lacks the filter value(s).

Think of bias as a multiplier to the score of the items that meet the condition so if bias = 2, and item-1 meets the condition, then multiply item-1's score times the bias. After all biases are applied the recommendations are returned ranked by score. The effect of bias is to:

One example of a filter is where it may make sense to show only "electronics" recommendations when the user is viewing an electronics product. Biases are often applied to a list of data, for instance the user is looking at a video page with a cast of actors. The "cast" list is metadata attached to items and a query can show "people who liked this, also liked these" type recommendations but also include the current cast boosted by 1.01. This can be seen as showing similar item recommendations but using the cast members to gently boost the similar items (since by default they have a neutral 1.0 boost). The result would be similar items favoring ones with similar cast members.

Dates

Dates can be used to specify the recommended items in one of 2 ways that should never be used together:

Dates are only used for filters but apply in all recommendation modes including all of the possible rankings. See Date Range Filters for details.

Engine.json

This file allows the user to describe and set parameters that control the engine operations. Many values have defaults so the following can be seen as the minimum for an ecom app with only one "buy" event. Reasonable defaults are used so try this first and add tunings or new event types and item property fields as you become more familiar.

Simple Default Values

{
  "comment":" This config file uses default settings for all but the required values see README.md for docs",
  "id": "default",
  "description": "Default settings",
  "engineFactory": "org.template.RecommendationEngine",
  "datasource": {
    "params" : {
      "name": "datasource-name",
      "appName": "handmade",
      "eventNames": ["purchase", "view"]
    }
  },
  "sparkConf": {
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
    "spark.kryo.referenceTracking": "false",
    "spark.kryoserializer.buffer": "300m",
    "spark.executor.memory": "4g",
    "es.index.auto.create": "true"
  },
  "algorithms": [
    {
      "comment": "simplest setup where all values are default, popularity based backfill, must add eventsNames",
      "name": "ur",
      "params": {
        "appName": "handmade",
        "indexName": "urindex",
        "typeName": "items",
        "comment": "must have data for the first event or the model will not build, other events are optional",
        "eventNames": ["purchase", "view"]
      }
    }
  ]
}

Full Prameters

A full list of tuning and config parameters is below. See the field description for specific meaning. Some of the parameters work as defaults values for every query and can be overridden or added to in the query other parameters control model building.

Note: It is strongly advised that you try the default/simple settings first before changing them. The exception is that at least one event name must be put in the eventNames array and the path to the Elasticsearch index must be specified in indexName and typeName.

{
  "id": "default",
  "description": "Default settings",
  "comment": "replace this with your JVM package prefix, like org.apache",
  "engineFactory": "org.template.RecommendationEngine",
  "datasource": {
    "params" : {
      "name": "some-data",
      "appName": "URApp1",
      "eventNames": ["buy", "view"]
      "eventWindow": {
        "duration": "3650 days",
        "removeDuplicates": false,
        "compressProperties": false
      } 
   }
 },
  “comment”: “This is for Mahout and Elasticsearch, the values are minimums and should not be removed”,
  "sparkConf": {
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
    "spark.kryo.referenceTracking": "false",
    "spark.kryoserializer.buffer": "300m",
    "spark.executor.memory": "4g",
    "es.index.auto.create": "true",
    "es.nodes": "node1,node2"
  },
  "algorithms": [
    {
      "name": "ur",
      "params": {
        "appName": "app1",
        "indexName": "urindex",
        "typeName": "items",
        "numESWriteConnections": 100, // how many simultaneous connection in writing to ES
        "eventNames": ["buy", "view"],
        "indicators": [
            {
                "name": "purchase"
            },{
                "name": "view",
                "maxCorrelatorsPerItem": 50,
                "minLLR": 5
            }
        ],
        "blacklistEvents": ["buy", "view"],
        "maxEventsPerEventType": 500,
        "maxCorrelatorsPerEventType": 50,
        "maxQueryEvents": 100,
        "num": 20,
        "seed": 3,
        "recsModel": "all",
        "rankings": [
          {
            "name": "popRank"
            "type": "popular", // or "trending" or "hot"
            "eventNames": ["buy", "view"],
            "duration": "3 days",
            "endDate": "ISO8601-date" // most recent date to end the duration
          },
          {
            "name": "uniqueRank"
            "type": "random"
          },
          {
            "name": "preferredRank"
            "type": "userDefined"
          }
        ],
        "expireDateName": "expireDateFieldName",
        "availableDateName": "availableDateFieldName",
        "dateName": "dateFieldName",
        "userbias": -maxFloat..maxFloat,
        "itembias": -maxFloat..maxFloat,
        "returnSelf": true | false,
        “fields”: [
          {
            “name”: ”fieldname”,
            “values”: [“fieldValue1”, ...],
            “bias”: -maxFloat..maxFloat,
          },...
        ]
      }
    }
  ]
}

Datasource Parameters

The datasource: params: section controls input data. This section is Algorithm independent and is meant to manage the size of data in the EventServer and do compaction. Is changes the persisted state of data. A fixed timeWindow: duration: will have the effect of making the UR calculate a model in a fixed amount of time as long as soon as there are enough events to start dropping old ones.

Spark Parameters

For the most part these are fixed. The exceptions are the Elasticsearch params that start with "es." These are documented on the Elasticsearch site. The common ones you might need are:

Algorithm Parameters

The Algorithm: params: section controls most of the features of the UR. Possible values are: