Getting Started With Spark Basics

4 min readNov 27, 2019

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLib for machine learning, Graphx for graph processing, and Spark StreamingSpark.Here are the spark core components

All the functionalities being provided by Apache Spark are built on the top of Spark Core the most import feature that brings into it is It overcomes the snag of MapReduce by using in-memory computation.

RDD

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. It is the fundamental data structure of Apache Spark. RDD in Apache Spark is an immutable collection of objects which computes on the different node of the cluster

The second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task variable support are broadcast variables and accumulators

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat

Parallelized Collections

Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program where it can operate in parallel

External Datasets

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase

Two major operation support by RDD is Transformation and actions

transformations, which create a new dataset from an existing one

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results.All transformations in Spark are lazy, in that they do not compute their results right away. Few transformation are filter,flatMap,distinct,union,intersection,groupByKey,reduceByKey,join

2) Action actions, which return a value to the driver program after running a computation on the

dataset.reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.Few other actions collect,count,first,take,saveAsTextFile,foreach

Spark SQL, DataFrames and Datasets

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Spark SQL uses this extra information to perform extra optimizations

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database but with richer optimizations under the hood

DataSet

Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.

While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object