Spark Tutorial: Spark Streaming with Kafka and MLib
In this part of Spark’s tutorial (part 3), we will introduce two important components of Spark’s Ecosystem: Spark Streaming and MLlib.
Spark Streaming
By Fadi Maalouli and R.H.
Spark Streaming is a real-time processing tool, that has a high level API, is fault tolerant, and is easy to integrate with SQL DataFrames and GraphX.
On a high level Spark Streaming works by running receivers that receive data from for example S3, Cassandra, Kafka etc… and it divides these data into blocks, then pushes these blocks into Spark, then Spark will work with these blocks of data as RDDs, from here you get your results. The following diagram will demonstrate the process: