Like many data-centric companies, we use Hadoop to routinely process large data sets on an hourly and daily basis. While Hadoop offers a great model for scaling data processing applications through its distributed file system (HDFS), Map/Reduce and numerous add-ons including Hive and Pig, all this power and distributed processing requires coordination.
Luckily, Apache ZooKeeper is offered as part of the Hadoop ecosytem. I suppose “ZooKeeper” is a play on the words used for some of the animals in the Hadoop ecosystem: The bees in the Hive and well, the pigs in Pig! More than a fancy name, it is an open source, in-memory, distributed NoSQL database, typically used for storing configuration variables. Let’s unpack what this means:
- Open source = free to use and modify and actively supported by the user community, as is the rest of the Hadoop ecosystem.
- In-memory = fast and highly available, backed by a file system for storage in case of failures.
- Distributed = it can be installed on one or more servers to maintain reliability and fault-tolerance, with automatic synchronization between the servers.
- NoSQL = it is modeled as a combination of key/values with a hierarchy like a file system, e.g. /myapp/config/var1 and /myapp/username. The read/write actions are based on simple commands.
Although many developers use ZooKeeper for managing configuration variables, here at FM we also use ZooKeeper for application coordination, maintaining state, and workflow modeling. To illustrate, let’s use a basic example that is common to many “Big Data” workflows:
Log files on servers -> Hadoop HDFS for processing and aggregation -> Relational Database -> Reporting
Each phase requires coordination to determine:
- When are the files available on the servers?
- When is the processing done and aggregated output from Hadoop ready for import into MySQL?
- When is the entire process complete, spanning hours or days of log data, in order to generate the reports?
Each “phase” may have one or more applications, written in different languages, with different strengths and weaknesses. Good software systems design practices suggest that loosely coupling each phase and perhaps each application is the best way to go. Loosely coupled services can be thought of as a factory floor where the first worker (or robot these days) makes a part and passes it along to the second worker who combines it with a different part and passes it along. The two workers don’t need to know each other’s job, they simply need to know how to hand off the products of their work.
We use ZooKeeper to implement loose coupling by having each link in the chain maintain its state through a ZooKeeper key (called a zNode) for that task sharded by date and hour:
In this way, we track progress, success, or failure of each step. ZooKeeper keeps track of the creation time and modification time of the zNode, so that timing can be tracked easily.
To pass work between applications, we have implemented queues in ZooKeeper.
To maintain the workflow, we run a monitoring application that checks the zNodes for different applications, and adds successful task items to the queue of the next application in the workflow. In the event of a failure, it can also move the task back to the queue of an application to retry the task.
I hope this provides a bit of an introduction to ZooKeeper. I recommend reading the ZooKeeper wiki for details.
In subsequent blog posts, I will provide examples of our use of ZooKeeper for maintaining state and coordinating applications using queues. I will also discuss a utility we created for using ZooKeeper from the command line and from most any software language, including Python, Perl and shell scripts.