An introduction to Project Hydrogen: how it can assist machine learning and AI frameworks on Apache Spark and what distinguishes it from other open source projects. By Reynold Xin, Co-Founder, Databricks Project Hydrogen aims at enabling first-class support for all distributed machine learning frameworks on Apache SparkTM, by substantially improving the performance and fault-recovery of distributed deep learning and machine learning frameworks on Spark.

Most open source projects around machine learning and AI are focused on the algorithms and distributed training frameworks. Project Hydrogen is a new SPIP (Spark Project Improvement Proposal) introducing one of the largest changes in Spark scheduling since the inception of the project, since the original 600 lines of code.

Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models. AI has always been one of the most exciting applications of big data and Apache Spark.

In part driven by deep learning, we see Increasingly more Spark users want to integrate Spark with distributed machine learning frameworks built for state-of-the-art training. The problem is, big data frameworks like Spark and distributed deep learning frameworks don’t play well together due to the disparity between how big data jobs are executed and how deep learning jobs are executed.

As an example, on Spark, each job is divided into a number of individual tasks that are independent of each other. This is called “embarrassingly parallel,” and this is a massively scalable way of doing data processing that can scale up to petabytes of data. Read more from…

thumbnail courtesy of