Scaling Machine Learning Models With Docker, ECS and Redis

In PubPlus’ Data Science Department, we strive to make life easier with automated tools. That’s why we were so enthusiastic to design and build an infrastructure that could handle both a rapid growth rate for machine learning model training and producing predictions of our key metrics of interest.

The Objectives

Running machine learning models and data processing jobs are time consuming and demanding tasks. Therefore, we needed to plan an approach that enables us to parallelize computations as much as possible while making sure our data is stable enough to have a scalable solution.

The required architecture should allow us to quickly test, evaluate, and compare machine learning models and data processing components. In addition, none of the computational tasks should override output data, this allows us to retroactively run a specific version of the code on an input to inspect bugs and detect performance bottlenecks.

The Building Blocks

After some research and brainstorming sessions, we’ve chosen the following abstracted entities to be the building blocks of our selected infrastructure:

Task: a code module that performs a specific computation on a given input and produces an output
Pipe: a data storage unit representing an input or an output to tasks
Flow: a directed graph where the nodes are tasks and the edges are pipes. The flow defines the dependencies list for each task that should wait until all dependencies are finished or a timeout is reached
State: a list of all the active entities in the system during a given time
Wave: an execution of a flow over state

Task

The tasks are describing code modules that would be executed in a containerized environment using AWS Elastic Container Service (ECS). We’ve selected Docker as our standardized unit of software due to its core benefits, such as:

Lightweight – Docker images share the machine OS kernel resulting in smaller image files compared to virtual machines
Portable – build an image from a Git branch that could be run on any worker
Flexible – write independent tasks as micro services in any programming language
Secure – each container environment is isolated from other containers

Each task implements a certain computation or machine learning model independently of other tasks implementation. This includes a preferred programming language, Python or Go are recommended.

Running experiments to evaluate models or code updates becomes as easy as deploying container cluster for each Git branch and inspecting the results on a Jupyter Notebook.

Function Task(task):
	while not all(task.dependencies.status == Status.Finished) \
            and Runtime() < task.timeout:
		Sleep(1) 

	UpdateStatus(task.status, Status.Running)

	config <- GetTaskConfig(task)
	pipes_data <- GetPipeData(task, config)
	processed_data <- ProcessData(pipes_data)

	PutPipeData(processed_data)

	UpdateStatus(task.status, Status.Finished)

Pseudo code of a task

Pipe

The pipes are used as the exclusive input and output data units for each task. We chose to store them on Redis in the following key pattern:

“{version}/{pipe_name} {entity_key} {date}”

Lets take a deeper look into each key component:

Version: the module of the task that yields pipe data as an output. The usage of version prefixes allows us to simultaneously run different code that can later be compared for evaluation
Pipe name: the name of the pipe corresponds to the name of the task that created it
Entity key: the unique identifier of the base entity in which the flow is executed
Date: the date the action is related to, this is added for partitioning and for adjusting time to live (TTL)

We chose Redis as our data storage database due to its ability to cost-effectively handle thousands of ops per second with a low latency.

Listing a ‘pages per visit’ pipe in Redis CLI

Flow

Our system needs to collect data across multiple third party sources via APIs. It also needs to process that data and use it for machine learning model training. There are two types to tasks: data collection tasks and data computation tasks.

Flow example

The data collection tasks are triggered at once. The other tasks are executed only after their dependent tasks are completed.

State

A state consists of the current set of entities and at any given time. The data collection tasks constantly update the state. The state is kept on Redis while a common feature is used for partitioning. For example, a state could be defined as the set of all the current active campaigns.

Wave

A wave is an execution of a flow over a state. Each time we want to run a wave, we trigger a scheduled init task that gets the latest state and a given flow. Each entity in the state is compiled of an executable bash script to reflect the order of the tasks and its dependencies as defined by the flow.

Making Machine Learning Tasks Fault Tolerance

During execution, there might be networking issues, dying workers, and/or bugs. As a result, some tasks may fail. We want to avoid sophisticated retry mechanisms in order to keep the architecture as simple as possible while maintaining stable outputs.

In order to achieve this goal, each machine learning task forecasts several predictions for the coming hours instead of predicting only the current hour. This behavior makes the system robust to external API failures and internal cluster issues with a tradeoff of lower confidence interval if predictions for the later hours aren’t replaced by a fresh wave.

Wrapping It All Together

The compiled tasks are saved to a queue and are pending collection by a worker in the containerized environment cluster.

Once the compiled tasks are collected, they are executed with a pending status. The computation won’t start until all the dependent tasks update their status to finished or until a predefined timeout is reached. Once running, the task will get input pipes from Redis, make the computation, save the result as a pipe to Redis, and update its status to finished.

Example of a wave execution over a flow.
Task color represents its status: Pending, Running, and Finished.

Conclusion

Scaling machine learning models is not trivial. Using AWS ECS with Docker as the computation engine and Redis as the data storage platform, we’ve managed to design a system that can execute tens of thousands of machine learning model instances without increasing the total execution time in a cost effective way.

Elad Zalait

Elad leads the Data Science Engineering team at PubPlus, he is dedicated to applying robust machine learning models to state-of-the-art big data architectures. During his free time, Elad likes to travel, dance, and play poker (occasionally winning).