Read Note - Learning Spark

Book Info Name: Learning Spark Author: Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia Publisher: O’Reilly Media Release Date: 2015 Topic: Learning Apache Spark 1.x Preface Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm. Apache Spark provides multiple components that can achieve lots of things: Spark SQL is the competitor of Hive for interactive queries MLlib is the competitor of Mahout for machine learning Spark Streaming is the competitor of Storm for streaming GraphX is the competitor of Neo4J for graph processing Apache Spark offers three main benefits: 1) easy to use....

April 1, 2016 · 3 min · 565 words · Eric

Data Warehouse Design - Inmon vs Kimball  [draft]

Inmon Framework (Enterprise Data Warehouse) A centralized data repository that Kimball Framework (Dimensional Data Warehouse) Reference Data Warehouse Architecture Comparison: Kimball and Inmon dbt: Kimball in the Context of Morden Data Warehouse

March 10, 2016 · 1 min · 32 words · Eric

Data Warehouse Concepts

Basic Concepts Dimensional Modelling Dimensional modeling is widely accepted as the preferred technique for presenting analytic data because it addresses two simultaneous requirements: Deliver that that’s understandable to the business users Deliver fast query performance Dimensional modelling always uses the concepts of facts and dimensions. Facts, or measures are typically (but not always) numeric values that can be aggregated; Dimensions are groups of hierarchies and descriptors that define the facts....

January 16, 2016 · 5 min · 1045 words · Eric

Managing Your Python Environment with Pipenv  [draft]

General Pipenv is a packaging tool for Python that solves some common problems associated with the typical workflow using pip, virtualenv, and requirements.txt. Pipenv is designed to resolve dependency management chaos created by requirements.txt. Introduction First, let’s install the pipenv package. Pipenv uses pip and virtualenv under the hood but simplifies their usage with a single command-line interface. 1 pip install --user pipenv Pipenv introduces two new files. Pipfile is to replace the old requirements....

December 3, 2015 · 4 min · 677 words · Eric

Type Hint with Python  [draft]

Bad Code 1 2 3 4 5 6 7 def get_authors_names(posts): authors_names = [] for post in posts: author = post["author"] author_name = author["name"] authors_names.append(author_name) return authors_names Good code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from typing import List, TypedDict class Author(TypedDict): name: str email: str bio: str website: str class Post(TypeDict): title: str author: Author publication_date: str content: str def get_authors_names(posts: List[Post]) -> List[str]: return [post["author"]["name"] for post in posts]

December 3, 2015 · 1 min · 81 words · Eric