Action ML Logo

Table of contents

The Db-cleaner Template

Big-data is one thing, infinite data is another and one to be avoided unless you have infinite resources. PredictionIO has methods you can add to your template to trim old events from your EventStore but by default these are not active in most templates. To make your template "self-cleaning" see the discussion below.

ActionMl also maintains a template for use with any other template that can be scheduled to periodically trim and compress any EventStore App's data. See the db-cleaner template here.

Download the template as you would with any template and go through the PIO workflow to pio build and pio train. pio deploy is not necessary since the work is done during pio train.

An example simple engine.json is:

{
  "id": "default",
  "description": "Default settings",
  "engineFactory": "com.actionml.templates.dbcleaner.DBCleaner",
  "datasource": {
    "params" : {
      "appName": "test-app",
      "eventWindow": {
        "duration": "4 days",
        "removeDuplicates": true,
        "compressProperties": true
      }
    }
  },
  "algorithms": [
    {
      "name": "db-cleaner-algo",
      "params": {
        "appName": "test-app"
      }
    }
  ]
}

The minimal config is in engine.json.template and does not include the optional de-duping and property compression.

{
  "id": "default",
  "description": "Default settings",
  "engineFactory": "com.actionml.templates.dbcleaner.DBCleaner",
  "datasource": {
    "params" : {
      "appName": "test-app",
      "eventWindow": {
        "duration": "4 days"
      }
    }
  },
  "algorithms": [
    {
      "name": "db-cleaner-algo",
      "params": {
        "appName": "test-app"
      }
    }
  ]
}

Add This Feature to Your Template

The SelfCleaningDataSource allows any template to specify an age for events. When events get too old they are removed permanently from the EventServer. It also allows a template to de-duplicate events, and to compact $set/$unset property change events.

The SelfCleaningDataSource must be added to a template with a very simple code change and has already been added to the Universal Recommender template. To add this feature to any template simple inherit SelfCleaningDataSource from your DataSource as is done in the UR here.

Template Code Change

Find the DataSource class in your template code and add the with clause and the logger, appName and eventWindowlike this:

class DataSource(val dsp: DataSourceParams) extends PDataSource[TrainingData, EmptyEvaluationInfo, Query, EmptyActualResult] //======= copy from here =========== with SelfCleaningDataSource {

  @transient override lazy val logger = Logger[this.type]

  override def appName = dsp.appName
  override def eventWindow = dsp.eventWindow
  //======= to here ===========
  
  ...
}

To use the newly extended DataSource simply add parameters to the engine.json described below and make this call:

def readTraining(sc: SparkContext): TrainingData = {
    // add this line to clean PEvents
    cleanPersistedPEvents(sc)

before you access PEvents. Note: Be aware that the old aged out events are permanently removed from the DataSource so keep a backup if you are experimenting.

Parameters

Then configure the DataSource operation in engine.json as follows:

"datasource": { "params" : { "name": "some-name", "appName": "some-app-name", "eventNames": ["purchase", "view"], "eventWindow": { "duration": "3650 days", "removeDuplicates": false, "compressProperties": false } } }