Building large scale systems that deal with a considerable amount of data often requires numerous ETL jobs and different processing mechanisms. In our case, for example, the ETL process consists of many transformations, such as normalizing, aggregating, deduplicating and enriching millions of car data records. These kinds of processes generally start by running manual flows, but comes a time when we need to start automating these tasks. To do so, we would need a Workflow Management System. This blog post will compare four of them: Luigi, Airflow, Pinball and Chronos.
Let’s get started.
Why Use a Workflow Management System?
Managing complex workflows and scheduling them can seem easy, so you might fall into the trap of building them yourself. However, this is probably not the best idea, because it’s likely you will encounter challenges that can be easily overcome by automation.
For example, when trying to run periodic tasks. In this case, you may need to write an algorithm that takes the output of one job and uses it as the input for another job. This task depends on preceding jobs to be completed successfully, because they have to run in chronological order. If the first task didn’t run properly, it could create errors later on. Your scheduler should be able to handle those kinds of situations, without requiring you to be constantly on the lookout for bugs and errors while wading knee-deep into your code and its dependencies.
The same thing happened to us here at Otonomo. We created a Spark Job on top of a managed Hadoop cluster that converts a given dataset into a standardized format. It then writes and saves these new files into an optimized partitioned location. Because we receive large amounts of data regularly and sporadically, we’ve decided to perform this action once an hour, ensuring we don’t miss out on any important data.
Then, after the conversion has taken place and the new files are written, we aggregate the data from the last 12 hours and offer summary reports of the findings. This process seems pretty simple, yet if a step fails (i.e, if an hour of data is missing) the result of the aggregated and summarized data is inaccurate and false.
Therefore, we needed to ensure that the proceeding task is triggered only when our conversion task runs and succeeds. If it did not succeed, a different task would be triggered. In other words, we needed a mechanism that would support the idea of jobs being triggered by the completion of other jobs. That’s when we decided we needed an ETL workflow framework, with a scheduler that would trigger the appropriate tasks it is programmed to.
Comparing Workflow Management Systems
Approximately 18 months ago, we looked into four main open source projects that we thought were useful for long dependency chains:
Luigi is a fairly popular open source project created by Spotify. It has a lot of great reviews online and the user interface for creating job flows is very easy to use. However, Luigi does not have a trigger mechanism, and as mentioned before, we needed a scheduler that was capable of finding and triggering new deployed tasks. Additionally, Luigi does not assign tasks to workers and isn’t highly capable of monitoring schedules.
Airflow is an open source project developed by AirBnB. It is supported by a large community of software engineers and can be utilized with a lot of different frameworks, including AWS. The maturity level of this project is high, yet it’s currently in the process of stabilization as it is being incubated by Apache.
Pinball is an open source project built by Pinterest. It currently runs on Python 2, so they are a bit behind in terms of new capabilities (we use Python 3). The user interface for Pinball was not user friendly and rather challenging to figure out. It also appeared to be unmaintained.
Chronos is another open source project created by AirBnB that runs on Mesos. Mesos is a distributing mechanism that manages computing resources, thereby allowing elastic applications to easily be built and created. Using Chronos would require us to build and maintain a Mesos environment, which isn’t worth doing just for scheduling capabilities. If we were not a cloud native platform, we would have considered using DC/OS (by Mesosphere) and then Chronos would we be a much more appealing option.
Workflow Management System Comparison Table
|Major Known Contributors||Spotify||AirBnB|
|License Type||Apache Version 2.0||Apache Version 2.0||Apache Version 2.0||Apache Version 2.0|
|Commit Frequency||Daily||Daily||Every Few Months||Every Few Months|
|Distributed Execution Capability||No||Yes||Yes||Yes|
Choosing the Best Workflow Management System for Us
Our main focus when doing the research was to find a framework that is maintained, has a built-in scheduler and can easily run on AWS cloud. As seen in the table, Airflow and Luigi were both highly maintained, but due to the lack of built-in scheduler in Luigi, Airflow had the most to offer.
At this point in time, we chose to go with Airflow. We believe it’s the best fit for job orchestration within our business, especially since we work in a Big Data cloud based environment.
Airflow itself uses DAGs (Directed Acyclic Graphs) which are composed of tasks, with dependencies between them. Those can be scheduled to run periodically, or triggered from the completion of another task. It uses a SQL database to store the state of the DAGs, and can scale using Celery to allow tasks to run on remote workers. We run Airflow on Docker containers on ECS, using Celery to spread the load of the tasks on multiple containers.
Which Workflow Management System do you use, and why? Tell us in the comments section.