Databricks Lakehouse Fundamentals

databricks lakehouse

What is a Data Lakehouse?

History of Data Warehouse

Pros

Business Intelligence (BI)
Analytics
Structured & Clean Data
Predefined Schemas

Cons

Not support semi or unstructured data
Inflexible schemas
Struggled with volume and velocity upticks
Long processing time

History of Data Lake

Pros

Flexible data storage
- Structured, semi-structured, and unstructured data
Steaming support
Cost efficient in the cloud
Support for AI and Machine Learning

Cons

No transactional support
Poor data reliability
- Data Lake are not supportive of transactional data, and cannot enforce data quality
- Primarily due to multiple data types
Slow analysis performance
- Because large volume of data, the performance of analysis is slower
- the timeliness of decision-making results has never manifested
Data governance concerns
- Governance over the data in a data lake creates challenges with security, and privacy enforcement due to the unstructured nature of the contents of a data lake
Data Warehouse still needed

Problems with Complex Data Environment

data warehouse vs data lake

Data Lake didn’t fully replaced Data Warehouse for reliable BI insights, Business has implemented complex systems to have Data Lake, Data Warehouse, and additional systems to handle streaming data, machine learning and artificial intelligence requirements.
Such environment introduced complexity and delay as data teams were stuck in silos, completing disjointed work.
Data had to be copied between the systems and in some cases copied back, impacting oversight and data usage governance
- not to mention the cost of storing the same information twice with disjointed systems

Data Lakehouse

data lakehouse

a combined open architecture that combine data lake with the analytical power and controls of a data warehouse.
built on a data lake, a data lakehouse can store all data of any type together
becoming a single reliable source of truth, providing direct access for AI and BI together.

Key features

Transaction support
Schema enforcement and governance
Data governance
BI Support
Decoupled storage form compute
Open storage formats
Support for diverse data types
Support for diverse workloads
- data science
- machine learning
- sql analytics
end-to-end streaming

The Data Lakehouse essentially is the modernised version of a data warehouse, providing all the benefits and features without compromising the flexibility in depth of a data lake.

Databricks Architecture Overview

databricks architecture

Databricks operates out of a control plane, and a data plane.

The Control Plane includes the backend service that Databricks manages in its own AWS account. Notebook commands and many other workspace configurations are stored in the control plane and encrypted at rest.
The Data Plane is where your data is processed
- Fro most Databricks computation, the compute resources are in your AWS account in what is called the Classic data plane. This is the type of data plane Databricks uses for notebooks, jobs, and for pro and classic Databricks SQL warehouses.
- If you enable Serverless compute for Databricks SQL, the compute resources for Databricks SQL are in a shared Serverless data plane. The compute resources for notebooks, jobs and pro and classic Databricks SQL warehouses still live in the Classic data plane in the customer account.

Lakehouse Platform Architecture

lakehouse paradigm

Problem with Data Lake

Lack of ACID transaction support
Lack of schema enforcement
Lake of integration with a data catalog
Too many small files

Delta Lake

ACID transaction guarantees
Scalable data and metadata handling
Audit history and time travel
Schema enforcement and schema evolution
Support for deletes, updates, and merges
Unified streaming and batch data processing
Additional
- compatible with Apache Spark
- Uses Delta Tables
- Has a transaction log
- Open-source project

Photon

Execution engine to support Lakehouse

photon

Unified Governance and Security

Challenges to data and AI Governance

Diversity of data and AI assets
Using two disparate and incompatible data platforms
- Data Warehouse for BI, and Data Lake for AI
Rise of multi-cloud adoption
Fragmented tool usage for data governance

Databricks’ Solution to solve above challenges

Unity Catalog is a unified governance solution for all data assets.
Delta Sharing, as an open solution to securely share live data to any computing platform
Control Plane
Data Plane

Unity Catalog

Unity Catalog is a unified governance solution for all data assets.
Modern lakehouse system support fine-grained row, column, and view level access control
- SQL
- query auditing
- attribute-based access control
- data visioning
- data quality
Unity Catalog allows you to restrict access to certain rows and columns to users or groups authorized to query them, with attribute based access control

Existing Data Sharing technology have several limitations
- traditional data sharing technologies do not scale well, often serve files offloaded to a server
- cloud object stores operate on an object level and are cloud specific
- commercial data sharing offerings often share tables instead of files

delta sharing

Open cross-platform sharing
- Allow you to share existing data in Delta Lake and Apache Parquet Format
- Native integration with PowerBI, Tableau, Spark, Pandas, and Java
Share live data without copying it
Centralised administration and governance
- audit on table, partition, version level
Marketplace for data products
Privacy-safe data clean rooms

Control Plane

Web Application
Configurations
Notebooks, Repos, DBSQL
Cluster Manager

Data Plane

Data Plane is where your data is processed
unless you choose to use Serverless Compute, the compute resources in the data plane run inside the business owner’s own cloud account.
All the data stay where it is.

control plane & data plane

Security in Data Plane

Databricks clusters are typically short-lived, often terminated after a job, and do not persist data after termination.
Code is launched in an unprivileged container to maintain system stability.

User Identity and Access

Table ACLs feature
IAM instance profiles
Securely stored access key
The Secrets API

Serverless Compute

Compute Resource Challenges

Cluster creation is complicated
Environment startup is slow
Business cloud account limitations and resource options
Long running clusters
Over provisioning of resources
Higher resource costs
High admin overhead
Unproductive users

Serverless Data Plane

Three layers of isolation
- The container hosting the runtime
- The virtual machine hosting the container
- The virtual network for the workspace

Lakehouse Data Management Terminology

lakehouse object model

Delta Lake
Catalog . A grouping of databases.
Database or schema. A grouping of objects in a catalog. Databases contain tables, views, and functions.
Table . A collection of rows and columns stored as data files in object storage.
- Managed table
- External table
View . a saved query typically against one or more tables or data sources.
Function . saved logic that returns a scalar value or set of rows.

Unity Catalog

Centralised governance for data and AI
Built-in data search and discovery
Performance and scale
Automated lineage for all workloads
Integrated with your existing tools
Unity Catalog namespacing model: catalog_name.database_name.table_name

Supported Workloads: Data Warehousing

Best price / performance
Built-in governance
A rich ecosystem
Break down silos

Supported Workloads: Data Engineering

Data is valuable business assets

Challenges to Data Engineering Support

Complex data ingestion methods
Support for data engineering principles
- CI/CD Pipeline
- Separation between Production and Development environments
- Testing before deployment
- Use Parameterisation to deploy and manage environments
- Uint Testing
- Documentation
Third-party orchestration tools
Pipeline and architecture performance tuning
Inconsistencies between data warehouse and data lake providers
A unified data platform with managed data ingestion, schema detection, enforcement, and evolution, paired with declarative, auto-scaling data flow integrated with a lakehouse native orchestrator that support all kinds of workflows.

Lakehouse Capacity

Easy data ingestion
- Auto Loader
- COPY INTO
Automated ETL pipelines
Data quality checks
Batch and streaming tuning
Automatic recovery
Data pipeline observability
Simplified operations
Scheduling and orchestration

Delta Live Tables

1
CREATE LIVE TABLE raw_data as SELECT * FROM json.'_'

Supported Workloads to Data Streaming

Real-time Analysis. Analyse streaming data for instant insights and faster decisions.
Real-time Machine Learning. Train models on the freshest data. Score in real-time.
Real-time Applications. Embed automatic and real-time actions into business applications.

Reasons to use Databricks

Build streaming pipelines and applications faster
Simplify operations with automation
Unified governance for real-time and historical data

Supported Workloads: Data Science & Machine Learning

Challenges to Data Science & Machine Learning Supported

Siloed and disparate data systems
Complex experimentation environments
Getting models to production
Multiple tools available
Experiments are hard to track
Reproducing results is difficult
ML is hard to deploy

Capability

Databricks Machine Learning Runtime
MLflow

Reference