Getting Started With Spark Basics
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLib for machine learning, Graphx for graph processing, and Spark StreamingSpark.Here are the spark core components
All the functionalities being provided by Apache Spark are built on the top of Spark Core the most import feature that brings into it is It overcomes the snag of MapReduce by using in-memory computation.
RDD
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. It is the fundamental data structure of Apache Spark. RDD in Apache Spark is an immutable collection of objects which computes on the different node of the cluster
The second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task variable support are broadcast variables and accumulators
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat
Parallelized Collections
Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program where it can operate in parallel
External Datasets
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase
Two major operation support by RDD is Transformation and actions
transformations, which create a new dataset from an existing one
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results.All transformations in Spark are lazy, in that they do not compute their results right away. Few transformation are filter,flatMap,distinct,union,intersection,groupByKey,reduceByKey,join
2) Action actions, which return a value to the driver program after running a computation on the
dataset.reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.Few other actions collect,count,first,take,saveAsTextFile,foreach
Spark SQL, DataFrames and Datasets
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Spark SQL uses this extra information to perform extra optimizations
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database but with richer optimizations under the hood
DataSet
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.
While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object
Spark SQL supports two different methods for converting existing RDDs into Datasets
1)Inferring the Schema Using Reflection
2)Programmatically Specifying the Schema
How to Submitting a job
1)create an sbt project with the following dependency
2) Create file SimpleApp.scala
3) create a package with which will create jar in target/ folder
sbt package
5) submit
YOUR_SPARK_HOME/bin/spark-submit \
— class “SimpleApp” \
— master local[4] \
target/scala-2.12/simple-project_2.12–1.0.jar