Cluster-computing framework Apache Spark

8 min readMay 20, 2020

Most of you would have heard about Apache Spark.

What is Apache Spark

Spark is an Apache project advertised as “lighting fast cluster computing”. It has a thriving open-source community and is the most active Apache project at the moment.

Spark started in 2009 as a research project at UC Berkley’s AMPLab, a collaboration involving students, researchers and faculty, focused on data-intensive application domains.

Spark provides a faster and more general data processing platform.

This is used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development API s in Java, Scala, python and R, and supports code reuse across multiple workloads — batch processing, interactive queries, real-time analytics, machine learning and graph processing.

Spark was built on the top of the Hadoop MapReduce. It was optimized to run in memory whereas alternative approaches like Hadoop’s MapReduce writes data to and from computer hard drives, so Spark process the data much quicker than other alternatives.

Spark is intended to cover an extensive variety of remaining loads, for example, cluster applications, interactive calculations, intuitive questions and streaming. Aside from supporting all these remaining tasks at hand in a particular framework, it decreases the administration weight of keeping us isolated apparatuses.

Uses of Apache Spark

Data Integration

The data generated by systems are not consistent enough to combine for analysis. To fetch consistent data from system we can use processes like Extract, Transform and Load (ETL). Spark is used to reduce the cost and time required for this ETL process

Machine Learning

Apache Spark is equipped with a scalable Machine Learning Library called MLlib that can perform advanced analytics such as clustering, classification etc. All this enables Spark to be used for some very common big data functions, like predictive intelligence, customer segmentation for marketing purposes, and sentiment analysis.

Upon arrival in storage, the packets undergo further analysis via other stack components such as MLlib.

Fog Computing

While big data analytics may be getting a lot of attention, the concept that really sparks the tech community’s imagination is the Internet of Things (IOT). With the influx of big data concepts, IOT has required a prominent space for the invention of more advanced technologies. Based on the theory of connecting digital devices with the help of small sensors, this technology deals with a huge amount of data deriving from numerous sources. This requires parallel processing, which is certainly not possible on cloud computing.

The IOT embeds objects and devices with tiny sensors that communicate with each other and the user, creating a fully interconnected world.

Interactive Analysis

Among the most notable features of Apache Spark is its ability to support interactive analysis. Apache Spark, is fast enough to perform exploratory queries without sampling. Unlike MapReduce that supports batch processing, Apache Spark processes data faster, because of which it can process exploratory queries without sampling. By combining Spark with visualization tools, complex data sets can be processed and visualized interactively.

Along with Apache Spark applications, Companies like,

Uber — Every day this multinational online taxi dispatch company gathers terabytes of event data from its mobile users. Uber uses Kafka, Spark Streaming, and HDFS for building a continuous ETL pipeline.

Pinterest —Pinterest uses Spark Streaming in order to gain deep insight into customer engagement details.

Conviva — Conviva uses Spark to reduce customer churn by optimizing video streams and managing live video traffic.

Architecture Model

Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled.

Apache Spark Architecture is based on two main abstractions:

Resilient Distributed Dataset (RDD)
Directed Acyclic Graph (DAG)

In your master node, we have the driver program, which drives your application. The code we are writing behaves as a driver program or if we are using the interactive shell, the shell acts as the driver program.

Inside the driver program, the first thing we do is, you create a Spark Context. Assume that the Spark context is a gateway to all the Spark functionalities. It is similar to your database connection. Any command you execute in your database goes through the database connection.

This Spark context works with the cluster manager to manage various jobs. The driver program & Spark context takes care of the job execution within the cluster. A job is split into multiple tasks which are distributed over the worker node. Anytime an RDD is created in Spark context, it can be distributed across various nodes and can be cached there.

Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are then executed on the partitioned RDDs in the worker node and hence returns back the result to the Spark Context.

Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes. These tasks work on the partitioned RDD, perform operations, collect the results and return to the main Spark Context.

Apache Spark preserves,

Fault tolerance

Fault tolerance in Apache spark is the capability to operate and to recover loss after a failure occur. If we want our system to be fault tolerant, it should be redundant because we require a redundant component to obtain the lost data. The faulty data recovers by redundant data.

If any of the nodes of processing data gets crashed, that results in a fault in a cluster. It means, RDD is logically partitioned and each node is operating on a partition at any point in time. Operations which are being performed is a series of Scala functions. Those operations are being executed on that partition of RDD. This series of operations are merged together and create a DAG, it refers to Directed Acyclic Graph. That means DAG keeps track of operations performed.

If any node crashes in the middle of an operation, the cluster manager finds out that node. It tries to assign another node to continue the processing at the same place. This node will operate on the same partition of RDD and series of operations. Due to this new node, there is effectively no data loss. Meanwhile, that new node continues the processing correctly.

The basic fault-tolerant significances of Spark are,

· Since RDDs are immutable in nature. Each Spark RDD remembers the lineage of the deterministic operation that was used on fault-tolerant input dataset to create it.

· If any partition of an RDD may be lost due to worker node failure, then using the lineage we can recompute it. This is possible from an original fault tolerant dataset.

· All of the RDD transformations are deterministic, the data in the final transformed RDD will always be the same irrespective of failures in the Spark cluster.

There are two types of data that demand to be recovered in the event of failure:

· Data received and replicated — In this, the data gets replicated on one of the other nodes thus the data can be retrieved when a failure.

· Data received but buffered for replication- The data is not replicated thus the only way to recover fault is by retrieving it again from the source.

2. High Available

Apache Spark is highly available if its downtime is tolerable. This time depends on how critical the application is. Zero down time is an imaginary term for any system. Consider any machine has an uptime of 97.7%, so its probability to go down will be 0.023. in most high availability environment we have three machines in use, in that case, the probability of going down is (0.023*0.023*0.023) i.e. 0.000012167, which guarantees an uptime of system to be 99.9987833% which is highly acceptable uptime guarantee.

3. Recoverable

A feature of self-recovery is one of the most powerful keys on spark platform. Which means at any stage of failure, RDD itself can recover the losses. RDD, has a capability of handling if any loss occurs. It can recover the failure itself, here fault refers to failure. If any bug or loss found, RDD has the capability to recover the loss. We need a redundant element to redeem the lost data. Redundant data plays important role in a self-recovery process. Ultimately, we can recover lost data by redundant data. while working on spark we may apply different transformations on RDDs. That creates a logical execution plan for all tasks executed. This logical execution plan is also popular as lineage graph. In the process, we may lose any RDD as if any fault arises in a machine. By applying the same computation on that node, we can recover our same datasets again.

4. Consistent

Spark’s model enables exactly-once semantics and consistency, meaning the system gives correct results despite slow nodes or failures.

5. Scalable

Apache Spark is a fast and general engine for large scale data processing based on MapReduce model. The main feature of Spark is the in-memory Computation.

This platform allows user programs to load data into memory and query it repeatedly, making it a well-suited tool for online and interactive processing. It was developed motivated by the limitations in the MapReduce Hadoop paradigm, which force to follow a linear dataflow that make an intensive disk usage. Spark is based on distributed data structure called Resilient Distribute Datasets (RDDs). Operations on RDDs automatically place tasks into portions, maintaining the locality of persisted data. Beyond this, RDDs are an immutable and versatile tool that let programmers persist intermediate results into memory or disk for re-usability purpose, and customize the partitioning to optimize data placement.

In addition, Machine Learning Library is formed by common learning algorithms and static utilities. This library has been specially designed to simplify ML pipelines in large-scale environments.

6. Secure

Security in spark is OFF by default. This mean, vulnerable to attack by default. Spark supports multiple deployments types and each one supports different level of security.

There are many different types of security concerns. These are some things spark supports.

Spark RPC (Communication protocol between Spark processes)

Authentication — Spark currently supports authentication for RPC channels using a shared secret.

The exact mechanism used to generate and distribute the shared secret is deployment-specific.

Each application will use a unique shared secret.

Encryption — Spark supports AES-based encryption for RPC connections. For encryption to be enabled, RPC authentication must also be enabled and properly configured.

Local Storage Encryption

Spark supports encrypting temporary data written to local disks. This covers shuffle files, shuffle spills and data blocks stored on disk. It does not cover encrypting output data generated by applications with APIs.

Cluster-computing framework Apache Spark

What is Apache Spark

Uses of Apache Spark

Architecture Model

Apache Spark preserves,

Written by Sahani Rajapakshe

No responses yet