Data Warehouse Concepts

Basic Concepts Dimensional Modelling Dimensional modeling is widely accepted as the preferred technique for presenting analytic data because it addresses two simultaneous requirements: Deliver that that’s understandable to the business users Deliver fast query performance Dimensional modelling always uses the concepts of facts and dimensions. Facts, or measures are typically (but not always) numeric values that can be aggregated; Dimensions are groups of hierarchies and descriptors that define the facts....

January 16, 2016 · 5 min · 1045 words · Eric

Managing Your Python Environment with Pipenv  [draft]

General Pipenv is a packaging tool for Python that solves some common problems associated with the typical workflow using pip, virtualenv, and requirements.txt. Pipenv is designed to resolve dependency management chaos created by requirements.txt. Introduction First, let’s install the pipenv package. Pipenv uses pip and virtualenv under the hood but simplifies their usage with a single command-line interface. 1 pip install --user pipenv Pipenv introduces two new files. Pipfile is to replace the old requirements....

December 3, 2015 · 4 min · 677 words · Eric

Type Hint with Python  [draft]

Bad Code 1 2 3 4 5 6 7 def get_authors_names(posts): authors_names = [] for post in posts: author = post["author"] author_name = author["name"] authors_names.append(author_name) return authors_names Good code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from typing import List, TypedDict class Author(TypedDict): name: str email: str bio: str website: str class Post(TypeDict): title: str author: Author publication_date: str content: str def get_authors_names(posts: List[Post]) -> List[str]: return [post["author"]["name"] for post in posts]

December 3, 2015 · 1 min · 81 words · Eric

Hive Query Performance Tuning

There are several parameters that we can tune in Hive to improve the overall query performance. For instance, 1 2 3 4 5 6 7 8 9 10 11 12 -- refers to http://hortonworks.com/community/forums/topic/mapjoinmemoryexhaustionexception-on-local-job/ -- before running your query to disable local in-memory joins and force the join to be done as a distributed Map-Reduce phase. -- After running your query you should set the value back to true with: set hive....

May 11, 2015 · 1 min · 108 words · Eric

Python Libraries

Stats & exploratory analysis Numpy. Pandas. SciPy. Statsmodels. Patsy. Emcee. Machine Learning Scikit-learn. TensorFlow. Keras. Theano. Lifelines. Data Visualization Matplotlib. Plotly. Seaborn. Geomap. Folium. Networkx. Basemap. Web Scraping BeautifulSoup. ScraPy. Requests. Natural Language Processing NLTK. Gensim. TextBlob. Utility when-changed. tenacity .

March 6, 2015 · 1 min · 41 words · Eric