Apache Spark Streaming at it’s Best
Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads.
Spark Streaming is different from other systems that either have a processing engine designed only for streaming, or have similar batch and streaming APIs but compile internally to different engines. Spark’s single execution engine and unified programming model for batch and streaming lead to some unique benefits over other traditional streaming systems. In particular, four major aspects are:
- Fast recovery from failures and stragglers
- Better load balancing and resource usage
- Combining of streaming data with static dataset and interactive queries
- Native integration with advanced processing libraries (SQL, machine learning, graph processing)
Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs
Discretized Streams (DStreams)
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset
Any operation applied on a DStream translates to operations on the underlying RDDs
Spark Streaming provides two categories of built-in streaming sources.
- Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
- Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes
val conf = new SparkConf().setMaster(“local”).setAppName(“NetworkWordCount”)
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream(“localhost”, 9999)
val words = lines.flatMap(_.split(“ “))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
Submit this spark job . And run nc -lk 9999 on another terminal and input data in it which will display total word count in that timeframe
Spark Streaming Operations
i. Transformation Operations in Spark
Similar to Spark RDDs, Spark transformations allow modification of the data from the input DStream. DStreams support many transformations that are available on normal Spark RDD’s. Some of the common ones are as follows.
map(), flatMap(), filter(), repartition(numPartitions), union(otherStream), count(), reduce(), countByValue(), reduceByKey(func, [numTasks]), join(otherStream, [numTasks]), cogroup(otherStream, [numTasks]), transform(), updateStateByKey(), Window()
ii. Output Operations in Apache Spark
DStream’s data push out to external systems like a database or file systems using Output Operations. Since external systems consume the transformed data as allowed by the output operations, they trigger the actual execution of all the DStream transformations. Currently, the following output operations define as:
print(), saveAsTextFiles(prefix, [suffix])”prefix-TIME_IN_MS[.suffix]”, saveAsObjectFiles(prefix, [suffix]), saveAsHadoopFiles(prefix, [suffix]), foreachRDD(func)
Hence, DStreams like RDDs execute lazily by the output operations