Basic Concepts
Dimensional Modelling
Dimensional modeling is widely accepted as the preferred technique for presenting analytic data because it addresses two simultaneous requirements:
- Deliver that that’s understandable to the business users
- Deliver fast query performance
Dimensional modelling always uses the concepts of facts and dimensions.
Facts, or measures are typically (but not always) numeric values that can be aggregated; Dimensions are groups of hierarchies and descriptors that define the facts.
For example, in a sale transaction dataset, sales amount is a fact; timestamp, product, region number, store number, etc. are elements of dimensions.
Normalisation vs Denormalisation
Normalised Data
A schema design to store non-redundant and consistent data.
- Data Integrity is maintained.
- Little to no redundant data.
- Many tables.
Denormalised Data
A schema that combines data so that accessing data (querying) is fast.
- Data Integrity is not maintained.
- Redundant data is common.
- Fewer tables.
Typical Data Warehouse Architecture
Staging / Integration Layer
- Create data in particular day
- Ensure all data comes from the same format
Data Warehouse / Data Vault
- All Data are having the same format
- Everything in 2NF
Data Marts
- Different department or team can have their dedicated Data Mart, with their own business logic applied.
- Provide information more quickly
- One user does data manipulation in his Data Mart will not affect other user in their Data Mart
ETL and Data Marts
- Extraction, Transformation and Loading
- Extraction. Get the Data
- Transformation. make it useful.
- Loading. Save it to the warehouse.
- Data Marts (Sub-sets of the Data Warehouse)
- Don’t mess with my data.
- Keep it simple for the user.
- Small problems are easier to solve
Schema Design
The dimensional model is built on a star-like schema or snowflake schema, with dimensions surrounding the fact table. To build the schema, the following design model is used:
- Choose the business process.
- Declare the grain
- Identify the dimensions
- Identify the fact
Step 1: Define business process
The basic in the design build on the actual business process which the data warehouse should cover. Therefore, the first step in the model is to describe the business process which the model builds on. To describe the business process, one you choose to do this in plain text or use basic Business Process Modeling Notation (BPMN) or other design guides like the Unified Modeling Language (UML).
Step 2: Declare the grain
The grain of the model is the exact description of what the dimensional model should be focusing on. This could for instance be “An individual line item on a customer slip from a retail store”. To clarify what the grain means, you should pick the central process and describe it w
Step 3: Identify the dimensions
Step 4: Identify the fact
Multidimensional Schemas
- Star Schema
- Snowflake Schema
- Kimball Model
Star Schema
Star Schema is the simplest type of Data Warehouse schema. In the Star Schema, the center of the star can have one fact tables and multiple associated dimension tables. It is also known as Star Join Schema and it optimised for querying large dataset.
Not in 3NF (Big Data)
- The data should be de-normalised to 2NF
- This means you get data redundancy
- This means you need more storage
- But you can get at the data more quickly
- The purposes of a DW is to provide aggregate data which is in suitable format for decision making.