Undestanding Apache Spark Architecture

Suraj Jeswara
5 min readMar 6, 2022

In order to hone my data engineering skills I have been recently doing a course on Databricks 🧱. This has helped me clear a lot of concepts and I intend to share some of my acquired knowledge with my readers. If you are interested in doing the course too the link to it is in the Appendix section below.

We will try to understand the concept using the analogy of cadbary gems, one of the most popular and favorate colorful chocolate candies in India.

Suppose you have a packet of gems and you like just the brown-colored candies. Can you segregate only the brown once in 60 Sec? Of course, you can. But what if you are handed 100 packets? Trouble! Right? It will take fair amount of time for an individual to segregate only the brown candies in 100 packets. Let us now imagine you have a class of students who can help you with this task. So, what resources would you need to bring about this task of segregation?

  • A table: For sitting a doing the segregation.
  • A place from which one can retrieve the packets of gems
  • Students to process the packets πŸ‘¨β€πŸŽ“
  • An orchestrator: To plan and organize
  • Packets of Gems properly and evenly divided 🍬🍬🍬
  • A place to store all the remaining candies other than brown after completion.

If you have understood by far try to answer the below question, you may share the answer in the comment.

Assuming that one person can process one packet of Cadbary Gems in 120 seconds, that there are only 20 students available, and that you have 100 packets of Cadbary Gems to process, how long will it take to complete the job?

We will use the below illustration to understand how this task can be executed.

Step 1: The orchestrator will divide the packets of Cadbary Gems into small number of parts(partition).

🍬🍬🍬🍬

🍬🍬🍬🍬

🍬🍬🍬🍬

Step 2: Identify each student and packet, assigning it a number for identification.

#s1 -> πŸ‘¨β€πŸŽ“

#s2 -> 🦸

#c1-> 🍬

#c3-> 🍬

Step 3: Assign each student a packet or bag of Gems. Like Student A gets packet #2, Student B gets packet #3

#s1 πŸ‘¨β€πŸŽ“ -> #c1 🍬

#s2 🦸 -> #c2🍬

Step 4: Ask the students to start sorting.

Step 5: The speed of sorting will vary from student to student, while some will be fast some will sloth.

Step 6: Some student will finish the task and will keep aside the remaining candies.

Step 7: Those students will be assigned new packets by the orchestrator while the rest of the students process their previous batch of gems.

Steps of execution 1) Driver decides how to start. 2) Assigns ID and allocates packets to students. 3) Students sort and keeps the remaining candies aside

Having understood the analogy lets link the terms used in it to the technical jargons in Spark.

1.Instructor or Orchestrator -> Driver (Is the incharge and gives instructions only)

2. Table at which students are working -> Executor (main application space/environment in which code runs. JVM)

3. Instructor, students and respective resource -> Cluster (Comprised of drivers and executors)

4. The Student -> Slots/Threads/Cores (The unit of parallelization)

5. The Small bags of Candies -> Partitions

6. The larger bag of candies -> Dataset

7. The request to eliminate all the brown candies & place the remaining candies on the table -> Job

8. Specific instructions assigned to a single student to process a specific bag of candies -> Tasks

The dataset is divided into partitions and each partition is assigned to a core/thread/slot as a single task. These core/thread/slot resides on executors. It could be seen in the below example that each executor has 2 core/thread/slot and hence 2 cores * 6 executors in total can handle 12 tasks. There is also concept of nodes, which are Virtual Machines. One node can have multiple executors sharing resources. But in case of databricks a single node has a single executor.

When an instruction is submitted to driver, it decides the approach of execution and divide it into Jobs. Jobs has one or more stages. Stages has tasks, which are a single unit of work against a single partition of data. To give an example. Suppose the job is to find only distinct colors in the whole bag of 100 packets. Then after every student gets their distinct set which is local a global distinct has to be calculated that can only be achieved once a student collects local distincts from all the students and then eliminate the duplicated colors giving us the global distinct. So in this case the job, stage and tasks will be:

Job: Find distinct colors of candies from 100 packets.

Stage 1: Find local distinct

Stage 2: Collect local distinct and find the global distinct

Task 1.1: Group same colored candies.

Task 1.1: Take only one of the same colored candies.

Task 2.1: Collect local distinct and find the final distinct.

This is how parallelization is obtained in databricks and huge volumn of data can be processed effectively in a relatively less amount of time. Hope you liked it. Don’t forget to like and follow as you leave πŸ˜…

So if we now read the googled defination of Spark it will make more sense to us.

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching and optimized query execution for fast queries against data of any size. Simply put, Spark is a fast and general engine for large-scale data processing.

Follow me on Linkedin :)

If my article was of any help you may drop me some tip by clicking at the Tip icon below :)

Appendix

--

--

Suraj Jeswara

I am passionate about learning new things and sharing it with others. :)