Batch processing is operations with large sets of static data based on reading and writes to disk and returning the result later when completing the data processing.
Hadoop with MapReduce is a typical batch operation, and therefore slower compared to the processes called “in-memory.”
Pointing out that we are talking about TeraBytes, PetaBytes, and ExaBytes data volume.
Spark is a framework that does not take the MapReduce layer of Hadoop. Its primary motivation is to carry out processing “in memory” not in disk storage as performed in batch MapReduce mode.
The performance gained is enormous because access to in-memory data is in nanoseconds while in the disk drive in milliseconds.
Spark emerged at the University of California Berkeley in 2009 as a research project to speed up machine learning algorithm’s execution on the Hadoop platform and became one core project of the Apache Foundation.
The creators of Spark founded the company Databricks that continues Spark development offering platform and services.
They developed Spark to address the limitations of MapReduce, which does not keep the data in-memory after processing for use and immediate analysis.
Spark provides layers for SparkSQL, Spark Streaming, MLib (Machine Learning) and GraphX (for graphs) which allow the use and construction of new software libraries.
The developer in Spark can use Scala, Python, Java, and R as languages.
Spark is becoming essential for companies that want to implement Big Data and also crucial in Data Scientist training.
Storm recorded and analyzed streaming data in real time. It includes a wide variety of data, such as device-generated logins, e-commerce, sensor data generation on the Internet of Things, social networking information, and geospatial services, among others.
Storm i is a Hadoop framework for data streaming that focuses on workloads that require near-real-time processing. Scalable, fault tolerant, easy to install, configure, and program. Can manage huge amounts of data and produce results in less time than other similar solutions.
Applications can use Storm to real-time data analysis, machine learning, Internet of Things, ETL, among others.
Flink is a framework for Hadoop for streaming data, which also handles batch processing.
The data streaming is finite, meaning you collect a certain amount of data, such as 500,000 tweets from Twitter, and handle them as a batch of processed data.
Flink handles batch processing as a subset of the flow processing, called the primary processing method, with flows used to complement and provide results from data analysis.
We can consider the applications the same ones that we relate to Storm, but considering flows, or finite part of the data analyzed.
Flink is free software from the Apache Foundation.
Apache Kafka e Samza
The Hadoop ecosystem grows every day searching performance.
Two more oriented tools emerged for streaming data that is Apache and Apache Kafka Samza.
Apache Kafka is a distributed platform for streaming data used to build applications using data structures captured in real-time (pipelines calls) in tens and hundreds of clusters at the same time.
Apache Samza is a framework for distributing processing of streaming data.
Samza uses Kafka for messaging, YARN for fault tolerance and resource management.
Spark can be 100 times faster than MapReduce using “in-memory” processing.
Flink seeks to work with finite data batch analysis using streams. While, Storm performs real-time, performance-focused data analysis.
Hadoop establishes the foundation for Spark to work, with distributing HDFS storage.
Hadoop uses Java language. Spark uses Scala, a programming language that contains compelling properties for data manipulation.
The choice of a data streaming framework depends on the application developed, the configurations of the servers, and the size and resources offered by the Hadoop network.
THE SPARK IMPORTANCE
Spark is part of the most successful Hadoop frameworks.
Apache Spark is a framework for distributed processing of Big Data, aiming speed and real-time analysis.
Uses Scala, Python, Java and R, and SQL libraries and streaming.
For Big Data applications in real time, Spark is the industry solution most used.
Spark allows applications on Hadoop clusters to execute 100 times faster than MapReduce as working memory and not in batches, using hard drives.
Spark is functional and access data in different formats coming from different platforms like MySQL, Amazon S3, HDFS, and Couchbase.
Spark is appealing for Big Data complex analysis, which is applications that use data of various types (structured and unstructured) involving text, database, graphs, social networks, streaming, etc. in a single application.
Many companies have invested in Hadoop to implement Data Lake (Lake of data), and once the data is in Hadoop, you can extract value from this data, and if there is the need for data analysis in real-time, so must- if you use Spark.
Basing on Kelly Sirman, Spark has grown in popularity because of its use in Machine Learning algorithms.
Apache Spark was developed in the APM Lab at UC Berkeley in 2009. Developers were working with Hadoop MapReduce, perceived inefficiencies in interactive computing execution, and created Spark.
Led by Matei Zaharia, lead developer of Spark, and founded Databricks to explore products and services of Big Data-based Spark. Databricks is the leading Spark developer working in partnership with the Apache Foundation.
Netflix, Yahoo, eBay are extensive users of Spark to process petabytes daily data in Hadoop installations containing over 10,000 nodes.
The largest open source developer community for Big Data today is Spark’s.Big Data Hadoop Technology?”
According to a recent report by IBM Marketing cloud, “90 percent of the data in the world today has been created in the last two years alone, creating 2.5 quintillion bytes of data every day — and with new devices, sensors and technologies emerging, the data growth rate will likely accelerate even more”. Technically this means our Big Data Processing world is going to be more complex and more challenging. And a lot of use cases (e.g. mobile app ads, fraud detection, cab booking, patient monitoring,etc) need data processing in real-time, as and when data arrives, to make quick actionable decisions. This is why Distributed Stream Processing has become very popular in Big Data world.
Today there are a number of open source streaming frameworks available. Interestingly, almost all of them are quite new and have been developed in last few years only. So it is quite easy for a new person to get confused in understanding and differentiating among streaming frameworks. In this post I will first talk about types and aspects of Stream Processing in general and then compare the most popular open source Streaming frameworks : Flink, Spark Streaming, Storm, Kafka Streams. I will try to explain how they work (briefly), their use cases, strengths, limitations, similarities and differences.
What is Streaming/Stream Processing : The most elegant definition I found is : a type of data processing engine that is designed with infinite data sets in mind. Nothing more.
Unlike Batch processing where data is bounded with a start and an end in a job and the job finishes after processing that finite data, Streaming is meant for processing unbounded data coming in realtime continuously for days,months,years and forever. As such, being always meant for up and running, a streaming application is hard to implement and harder to maintain.
Important Aspects of Stream Processing:
There are some important characteristics and terms associated with Stream processing which we should be aware of in order to understand strengths and limitations of any Streaming framework :
Delivery Guarantees : It means what is the guarantee that no matter what, a particular incoming record in a streaming engine will be processed. It can be either Atleast-once (will be processed atleast one time even in case of failures) , Atmost-once (may not be processed in case of failures) or Exactly-once (will be processed one and exactly one time even in case of failures) . Obviously Exactly-once is desirable but is hard to achieve in distributed systems and comes in tradeoffs with performance.
Fault Tolerance : In case of failures like node failures,network failures,etc, framework should be able to recover and should start processing again from the point where it left. This is achieved through checkpointing the state of streaming to some persistent storage from time to time. e.g. checkpointing kafka offsets to zookeeper after getting record from Kafka and processing it.
State Management : In case of stateful processing requirements where we need to maintain some state (e.g. counts of each distinct word seen in records), framework should be able to provide some mechanism to preserve and update state information.
Performance : This includes latency(how soon a record can be processed), throughput (records processed/second) and scalability. Latency should be as minimum as possible while throughput should be as much as possible. It is difficult to get both at same time.
Advanced Features : Event Time Processing, Watermarks, Windowing These are features needed if stream processing requirements are complex. For example, processing records based on time when it was generated at source (event time processing). To know more in detail, please read these must-read posts by Google guy Tyler Akidau : part1 and part2.
Maturity : Important from adoption point of view, it is nice if the framework is already proven and battle tested at scale by big companies. More likely to get good community support and help on stackoverflow.
Two Types of Stream Processing:
Now being aware of the terms we just discussed, it is now easy to understand that there are 2 approaches to implement a Streaming framework:
Native Streaming : Also known as Native Streaming. It means every incoming record is processed as soon as it arrives, without waiting for others. There are some continuous running processes (which we call as operators/tasks/bolts depending upon the framework) which run for ever and every record passes through these processes to get processed. Examples : Storm, Flink, Kafka Streams, Samza.
Micro-batching : Also known as Fast Batching. It means incoming records in every few seconds are batched together and then processed in a single mini batch with delay of few seconds. Examples: Spark Streaming, Storm-Trident.
Both approaches have some advantages and disadvantages. Native Streaming feels natural as every record is processed as soon as it arrives, allowing the framework to achieve the minimum latency possible. But it also means that it is hard to achieve fault tolerance without compromising on throughput as for each record, we need to track and checkpoint once processed. Also, state management is easy as there are long running processes which can maintain the required state easily.
Micro-batching , on the other hand, is quite opposite. Fault tolerance comes for free as it is essentially a batch and throughput is also high as processing and checkpointing will be done in one shot for group of records. But it will be at some cost of latency and it will not feel like a natural streaming. Also efficient state management will be a challenge to maintain.
Streaming Frameworks One By One:
Storm : Storm is the hadoop of Streaming world. It is the oldest open source streaming framework and one of the most mature and reliable one. It is true streaming and is good for simple event based use cases. I have shared details about Storm at length in these posts: part1 and part2.
Very low latency,true streaming, mature and high throughput
Excellent for non-complicated streaming use cases
No state management
No advanced features like Event time processing, aggregation, windowing, sessions, watermarks, etc
Spark Streaming :
Spark has emerged as true successor of hadoop in Batch processing and the first framework to fully support the Lambda Architecture (where both Batch and Streaming are implemented; Batch for correctness, Streaming for Speed). It is immensely popular, matured and widely adopted. Spark Streaming comes for free with Spark and it uses micro batching for streaming. Before 2.0 release, Spark Streaming had some serious performance limitations but with new release 2.0+ , it is called structured streaming and is equipped with many good features like custom memory management (like flink) called tungsten, watermarks, event time processing support,etc. Also Structured Streaming is much more abstract and there is option to switch between micro-batching and continuous streaming mode in 2.3.0 release. Continuous Streaming mode promises to give sub latency like Storm and Flink, but it is still in infancy stage with many limitations in operations.
Supports Lambda architecture, comes free with Spark
High throughput, good for many use cases where sub-latency is not required
Fault tolerance by default due to micro-batch nature
Simple to use higher level APIs
Big community and aggressive improvements
Not true streaming, not suitable for low latency requirements
Flink is also from similar academic background like Spark. While Spark came from UC Berkley, Flink came from Berlin TU University. Like Spark it also supports Lambda architecture. But the implementation is quite opposite to that of Spark. While Spark is essentially a batch with Spark streaming as micro-batching and special case of Spark Batch, Flink is essentially a true streaming engine treating batch as special case of streaming with bounded data. Though APIs in both frameworks are similar, but they don’t have any similarity in implementations. In Flink, each function like map,filter,reduce,etc is implemented as long running operator (similar to Bolt in Storm)
Flink looks like a true successor to Storm like Spark succeeded hadoop in batch.
Leader of innovation in open source Streaming landscape
First True streaming framework with all advanced features like event time processing, watermarks, etc
Low latency with high throughput, configurable according to requirements
Auto-adjusting, not too many parameters to tune
Getting widely accepted by big companies at scale like Uber,Alibaba.
Little late in game, there was lack of adoption initially
Community is not as big as Spark but growing at fast pace now
No known adoption of the Flink Batch as of now, only popular for streaming.
Kafka Streams :
Kafka Streams , unlike other streaming frameworks, is a light weight library. It is useful for streaming data from Kafka , doing transformation and then sending back to kafka. We can understand it as a library similar to Java Executor Service Thread pool, but with inbuilt support for Kafka. It can be integrated well with any application and will work out of the box.
Due to its light weight nature, can be used in microservices type architecture. There is no match in terms of performance with Flink but also does not need separate cluster to run, is very handy and easy to deploy and start working . Internally uses Kafka Consumer group and works on the Kafka log philosophy. This post thoroughly explains the use cases of Kafka Streams vs Flink Streaming.
One major advantage of Kafka Streams is that its processing is Exactly Once end to end. It is possible because the source as well as destination, both are Kafka and from Kafka 0.11 version released around june 2017, Exactly once is supported. For enabling this feature, we just need to enable a flag and it will work out of the box. For more details shared here and here.
Very light weight library, good for microservices,IOT applications
Does not need dedicated cluster
Inherits all Kafka good characteristics
Supports Stream joins, internally uses rocksDb for maintaining state.
Exactly Once ( Kafka 0.11 onwards).
Tightly coupled with Kafka, can not use without Kafka in picture
Quite new in infancy stage, yet to be tested in big companies
Not for heavy lifting work like Spark Streaming,Flink.
Will cover Samza in short. Samza from 100 feet looks like similar to Kafka Streams in approach. There are many similarities. Both of these frameworks have been developed from same developers who implemented Samza at LinkedIn and then founded Confluent where they wrote Kafka Streams. Both these technologies are tightly coupled with Kafka, take raw data from Kafka and then put back processed data back to Kafka. Use the same Kafka Log philosophy. Samza is kind of scaled version of Kafka Streams. While Kafka Streams is a library intended for microservices , Samza is full fledge cluster processing which runs on Yarn. Advantages :
Very good in maintaining large states of information (good for use case of joining streams) using rocksDb and kafka log.
Fault Tolerant and High performant using Kafka properties
One of the options to consider if already using Yarn and Kafka in the processing pipeline.
Good Yarn citizen
Low latency , High throughput , mature and tested at scale
Tightly coupled with Kafka and Yarn. Not easy to use if either of these not in your processing pipeline.
Atleast-Once processing guarantee. I am not sure if it supports exactly once now like Kafka Streams after Kafka 0.11
Lack of advanced streaming features like Watermarks, Sessions, triggers, etc
Comparison of Streaming Frameworks:
We can compare technologies only with similar offerings. While Storm, Kafka Streams and Samza look now useful for simpler use cases, the real competition is clear between the heavyweights with latest features: Spark vs Flink
When we talk about comparison, we generally tend to ask: Show me the numbers 🙂
Benchmarking is a good way to compare only when it has been done by third parties.
For example one of the old bench marking was this. But this was at times before Spark Streaming 2.0 when it had limitations with RDDs and project tungsten was not in place. Now with Structured Streaming post 2.0 release , Spark Streaming is trying to catch up a lot and it seems like there is going to be tough fight ahead.
Recently benchmarking has kind of become open cat fight between Spark and Flink.
It is better not to believe benchmarking these days because even a small tweaking can completely change the numbers. Nothing is better than trying and testing ourselves before deciding. As of today, it is quite obvious Flink is leading the Streaming Analytics space, with most of the desired aspects like exactly once, throughput, latency, state management, fault tolerance, advance features, etc.
These have been possible because of some of the true innovations of Flink like light weighted snapshots and off heap custom memory management. One important concern with Flink was maturity and adoption level till sometime back but now companies like Uber,Alibaba,CapitalOne are using Flink streaming at massive scale certifying the potential of Flink Streaming.
One important point to note, if you have already noticed, is that all native streaming frameworks like Flink, Kafka Streams, Samza which support state management uses RocksDb internally. RocksDb is unique in sense it maintains persistent state locally on each node and is highly performant. It has become crucial part of new streaming systems. I have shared detailed info on RocksDb in one of the previous posts.
How to Choose the Best Streaming Framework :
This is the most important part. And the honest answer is: it depends 🙂 It is important to keep in mind that no single processing framework can be silver bullet for every use case. Every framework has some strengths and some limitations too. Still , with some experience, will share few pointers to help in taking decisions:
Depends on the use cases: If the use case is simple, there is no need to go for the latest and greatest framework if it is complicated to learn and implement. A lot depends on how much we are willing to invest for how much we want in return. For example, if it is simple IOT kind of event based alerting system, Storm or Kafka Streams is perfectly fine to work with.
Future Considerations: At the same time, we also need to have a conscious consideration over what will be the possible future use cases? Is it possible that demands of advanced features like event time processing,aggregation, stream joins,etc can come in future ? If answer is yes or may be, then its is better to go ahead with advanced streaming frameworks like Spark Streaming or Flink. Once invested and implemented in one technology, its difficult and huge cost to change later. For example, In previous company we were having a Storm pipeline up and running from last 2 years and it was working perfectly fine until a requirement came for uniquifying incoming events and only report unique events. Now this demanded state management which is not inherently supported by Storm. Although I implemented using time based in-memory hashmap but it was with limitation that the state will go away on restart . Also, it gave issues during such changes which I have shared in one of the previous posts. The point I am trying to make is, if we try to implement something on our own which the framework does not explicitly provide, we are bound to hit unknown issues.
Existing Tech Stack : One more important point is to consider the existing tech stack. If the existing stack has Kafka in place end to end, then Kafka Streams or Samza might be easier fit. Similarly, if the processing pipeline is based on Lambda architecture and Spark Batch or Flink Batch is already in place then it makes sense to consider Spark Streaming or Flink Streaming. For example, in my one of previous projects I already had Spark Batch in pipeline and so when the streaming requirement came, it was quite easy to pick Spark Streaming which required almost the same skill set and code base.
In short, If we understand strengths and limitations of the frameworks along with our use cases well, then it is easier to pick or atleast filtering down the available options. Lastly it is always good to have POCs once couple of options have been selected. Everyone has different taste bud after all.
Apache Streaming space is evolving at so fast pace that this post might be outdated in terms of information in couple of years. Currently Spark and Flink are the heavyweights leading from the front in terms of developments but some new kid can still come and join the race. Apache Apex is one of them. Also there are proprietary streaming solutions as well which I did not cover like Google Dataflow. My objective of this post was to help someone who is new to streaming to understand, with minimum jargons, some core concepts of Streaming along with strengths, limitations and use cases of popular open source streaming frameworks. Hope the post was helpful in someway.