Name: Learning Spark
Author: Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
Publisher: O’Reilly Media
Release Date: 2015
Topic: Learning Apache Spark 1.x
Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm.
Apache Spark provides multiple components that can achieve lots of things:
- Spark SQL is the competitor of Hive for interactive queries
- MLlib is the competitor of Mahout for machine learning
- Spark Streaming is the competitor of Storm for streaming
- GraphX is the competitor of Neo4J for graph processing
Apache Spark offers three main benefits: 1) easy to use. 2) fast. 3) general engine.
If you don’t have any experience with Python, the book Learning Python and Head First Python are excellent introductions. If you have some experience with Python, Dive into Python is a great book to get deeper understanding of Python. If you are an engineer and you want to expand your data analysis skills, Machine Learning for Hackers and Doing Data Science are excellent books.
Chapter 1: Introduction to Data Analysis with Spark
Apache Spark is a cluster computing platform designed to be fast and general-purpose.
A Unified Stack
A philosophy of tight integration has several benefits: 1) all libraries and higher level components in the stack benefit from improvements at the lower layers. 2) The costs associated with running the stack are minimized.
- The ability to build applications that seamlessly combine different processing models.
Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction.
Spark SQL is Spark’s package for working with structured data. It supports many sources of data, including Hive tables, Parquet, and JSON.
Spark Streaming is a Spark component that enables processing of live streams of data.
MLlib provides multiple types of machine learning algorithms, including classification, regression, clustering and collaborative filtering, as well as supporting functionality such as model evaluation and data import, lower-level ML primitives such as generic gradient descent optimization algorithm.
GraphX is a library of manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations.
A Data Scientist is somebody whose main task is to analyze and model data. Sometimes, after the initial exploration phase, the work of a data scientist will be “productized”, or extended, hardened (i.e., made fault-tolerant), and tuned to become a production data processing application, which itself is a component of a business application.
Drawback of MapReduce Model
MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas like support for in-memory storage and efficient fault recovery.
Storage Layers for Spark
Spark does not require Hadoop. It simply has support for storage systems implementing the Hadoop APIs. Spark supports text files, SequenceFiles, Avro, Parquet and any other Hadoop InputFormats.
Chapter 2: Downloading Spark and Getting Startted
To exit either spark-shell, or pyspark shell, you can use Ctrl-D.
Chapter 3: Programming with RDDs
An RDD is simply a distributed collection of elements. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.