Book Info

Name: Learning Spark

Author: Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia

Publisher: O’Reilly Media

Release Date: 2015

Topic: Learning Apache Spark 1.x


Preface

Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm.

The Spark stack

Apache Spark provides multiple components that can achieve lots of things:

  • Spark SQL is the competitor of Hive for interactive queries
  • MLlib is the competitor of Mahout for machine learning
  • Spark Streaming is the competitor of Storm for streaming
  • GraphX is the competitor of Neo4J for graph processing

Apache Spark offers three main benefits: 1) easy to use. 2) fast. 3) general engine.

Supporting books

If you don’t have any experience with Python, the book Learning Python and Head First Python are excellent introductions. If you have some experience with Python, Dive into Python is a great book to get deeper understanding of Python. If you are an engineer and you want to expand your data analysis skills, Machine Learning for Hackers and Doing Data Science are excellent books.

Chapter 1: Introduction to Data Analysis with Spark

Apache Spark is a cluster computing platform designed to be fast and general-purpose.

A Unified Stack

A philosophy of tight integration has several benefits: 1) all libraries and higher level components in the stack benefit from improvements at the lower layers. 2) The costs associated with running the stack are minimized.

  1. The ability to build applications that seamlessly combine different processing models.

Spark Core

Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction.

Spark SQL

Spark SQL is Spark’s package for working with structured data. It supports many sources of data, including Hive tables, Parquet, and JSON.

Spark Streaming

Spark Streaming is a Spark component that enables processing of live streams of data.

MLlib

MLlib provides multiple types of machine learning algorithms, including classification, regression, clustering and collaborative filtering, as well as supporting functionality such as model evaluation and data import, lower-level ML primitives such as generic gradient descent optimization algorithm.

GraphX

GraphX is a library of manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations.

Data Scientist

A Data Scientist is somebody whose main task is to analyze and model data. Sometimes, after the initial exploration phase, the work of a data scientist will be “productized”, or extended, hardened (i.e., made fault-tolerant), and tuned to become a production data processing application, which itself is a component of a business application.

Drawback of MapReduce Model

MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas like support for in-memory storage and efficient fault recovery.

Storage Layers for Spark

Spark does not require Hadoop. It simply has support for storage systems implementing the Hadoop APIs. Spark supports text files, SequenceFiles, Avro, Parquet and any other Hadoop InputFormats.

Chapter 2: Downloading Spark and Getting Startted

To exit either spark-shell, or pyspark shell, you can use Ctrl-D.

Components for distributed execution in Spark

Chapter 3: Programming with RDDs

An RDD is simply a distributed collection of elements. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.