General
- Databricks Certified Data Engineer Associate: link
- Time allotted to complex exam is 1.5 hours (90 minutes)
- Exam fee $200 USD
- Number of questions 45
- Passing scores is at least 70% on the overall exam
- Code Example
- data manipulation code will be in SQL when possible
- Structured Streaming code will be in Python
- Runtime version is DBR 10.4 LTS
- Practice Exam: link
Expectation
- Databricks Lakehouse Platform (24%). Understand how to use and the benefits of using the
Databricks Lakehouse Platform
and its tools. - ETL with Spark SQL and Python (29%). Build ETL pipeline using
Apache Spark SQL
andPython
. - Incremental Data Processing (22%). Incrementally process data with
Structured Streaming
, andDelta Live Table
. - Production Pipelines (16%).
Build production pipelines
for data engineering applications, andDatabricks SQL queries
and dashboards. - Data Governance (9%). Understand and follow best
security practices
.
Out-of-scope
- Apache Spark Internals
- Databricks CLI
- Databricks REST API
- Change Data Capture
- Data modeling concepts
- Notebooks and Job permissions
- Personal Identified Information (PII)
- GDPR / CCPA
- Monitoring and logging production jobs
- Dependency management
- Testing
Data Lakehouse Platform (24%)
- Lakehouse: Lakehouse description, Lakehouse Benefits to Data Teams
- Data Lakehouse vs Data Warehouse
- Data Lakehouse vs Data Lake
- Data Quality Improvements
- Data Science and Engineering Workspace: Clusters, DBFS, Notebooks, Repos
- High level architecture
- Key components of a workspace deployment
- Core services in a Databricks workspace deployment
- Delta Lake: General Concepts, Table Management, Table Manipulation, Optimisations
- Organisational data problems resolved with Lakehouse
- Benefits to different roles in a data team
Databricks Lakehouse Concepts
- Lakehouse Concepts
- Data Lakehouse vs Data Warehouse
- Data Lakehouse vs Data Lake
- Data Quality Improvements
- Platform Architecture
- High level architecture and key components of a workspace deployment
- Core services in a Databricks workspace deployment
- Benefits to Data Teams
- Organisational data platforms solved with Lakehouse
- Benefits to different roles in a data team
Data Science and Engineering Workspace
- Clusters
- All-purpose clusters vs jobs clusters
- Cluster instance and pools
- Databricks File System (DBFS)
- Managing permission on tables
- Role permission and functions
- Data Explorer
- Notebooks
- Features and limitations
- Collaboration best practices
- Repos
- Supported features and Git operations
- Relevance in CI/CD workflows in Databricks
Delta Lake
- General Concepts
- ACID transactions on a data lake
- Features and benefits of Delta Lake
- Table Management & Manipulation
- Creating tables
- Managing files
- Writing to tables
- Dropping tables
- Optimisations
- Supported features and benefits
- Table utilities to manage files
Self-assessment for Data Lakehouse Platform
Objectives | Options |
---|---|
Identify how the data lakehouse solves a common organisational data problem | Very under-prepared, Somewhat under-prepared, Prepared |
Describe the most efficient write operation for a specific task | Very under-prepared, Somewhat under-prepared, Prepared |
Identify limitations in Databricks Notebooks version control functionality relative to Repos | Very under-prepared, Somewhat under-prepared, Prepared |
Identify why Z-ordering is beneficial to Delta Lake tables | Very under-prepared, Somewhat under-prepared, Prepared |
ELT with Spark SQL and Python (29%)
Build ETL pipelines using Apache Spark SQL and Python
- Relational entities (databases, tables, views)
- ELT
- creating tables
- writing data to tables
- transforming data
- UDFs
- Manipulating data with Spark SQL and Python
Relational Entities
Leverage Spark SQL DDL to create and manipulate relational entities on Databricks
- Databases
- Create databases in specific locations
- Retrieve locations of existing databases
- Modify and delete databases
- Tables
- Managed vs external tables
- Create and drop managed and external tables
- Query and modify managed and external tables
- Views and CTEs
- Views vs Temporary views
- Views vs Delta Lake tables
- Creating views and CTEs
ELT: Extract & Load Data into Delta Lake
Use SPark SQL to extract, load, and transform data to support production workloads and analytics in the Lakehouse
- Creating tables
- External sources vs Delta Lake tables
- Methods to create tables and use cases
- Delta table configurations
- Different file formats and data sources
- Create Tables as Select Statements
- Writing Data to Tables
- Methods to write to tables and use cases
- Efficiency for different operations
- Resulting behaviours in target tables
ELT: Use Spark SQL to Transform Data
- Cleaning Data with Common SQL
- Methods to deduplicate data
- Common cleaning methods for different types of data
- Combining Data
- Join types and strategies
- Set operators and applications
- Reshaping Data
- Different operations to transform arrays
- Benefits of array functions
- Applying higher-order functions
- Advanced Operations
- Manipulating nested data fields
- Applying SQL UDFs for custom transformations
Just Enough Python
Leverage PySpark for advanced code functionality needed in production applications
- Spark SQL Queries
- Executing Spark SQL Queries in PySpark
- Passing Data to SQL
- Temporary views
- Converting tables to and from DataFrames
- Python Syntax
- Functions and variables
- Control flow
- Error handling
Self-assessment for ELT with Spark SQL and Python
Objectives | Options |
---|---|
Load and parse nested data from JSON files into a Delta table | Very under-prepared, Somewhat under-prepared, Prepared |
Describe the most efficient write operation for a specific task | Very under-prepared, Somewhat under-prepared, Prepared |
Explode and flatten arrays in a dataset | Very under-prepared, Somewhat under-prepared, Prepared |
Apply a SQL UDF to complete a custom task | Very under-prepared, Somewhat under-prepared, Prepared |
Incremental Data Processing (22%)
- Structured Streaming (general concepts, triggers, watermarks)
- Auto Loader (streaming reads)
- Multi-hop Architecture (bronze-silver-gold, streaming applications)
- Delta Live Tables (benefits and features)
Structured Streaming
- General Concepts
- Programming model
- Configuration for reads and writes
- End-to-end fault tolerance
- Interacting with streaming queries
- Triggers
- Set up streaming writes with different trigger behaviours
- Watermarks
- Unsupported operations on streaming data
- Scenarios in which watermarking data would be necessary
- Managing state with watermarking
Auto Loader
Incrementally process data to power analytics insights with Spark Structured Streaming and AutoLoader
- Define streaming reads with Auto Loader and PySpark to load data into Delta
- Define streaming reads on tables for SQL manipulation
- Identifying source locations
- Use cases for using Auto Loader
Multi-hop Architecture
Propagate new data through multiple tables in the data lakehouse
- Bronze
- Bronze vs raw tables
- Workloads using bronze tables as source
- Sliver & Gold
- Silver vs gold tables
- Workloads using silver table as source
- Structured Streaming in Multi-hop
- Converting data from bronze to silver levels with validations
- Converting data from silver to gold levels with aggregations
Delta Live Tables
- General Concepts
- Benefits of using Delta Live Tables for ETL
- Scenarios that benefits from Delta Live Tables
- UI
- Deploying DLT pipelines from notebooks
- Executing updates
- Explore and evaluate results from DLT pipelines
- SQL Syntax
- Converting SQL definitions to Auto Loader syntax
- Common differences in DLT SQL syntax
Self-assessment for Incremental Data Processing
Objectives | Options |
---|---|
Set up a structured streaming writing with specified configurations | Very under-prepared, Somewhat under-prepared, Prepared |
Describe the use of bronze, silver, or gold tables for different workloads | Very under-prepared, Somewhat under-prepared, Prepared |
Deploy a DLT pipeline from an existing notebook | Very under-prepared, Somewhat under-prepared, Prepared |
Production Pipelines (16%)
Building production pipelines for data engineering applications and Databricks SQL queries and dashboards, including
- Workflows (Job scheduling, task orchestration, UI)
- Dashboards (endpoints, scheduling, alerting, refreshing)
Workflows
- Automation
- Setting up retry policies
- Using cluster pools and why
- Task Orchestration
- Benefits of using multiple tasks in Job
- Configuring predecessor tasks
- UI
- Using notebook parameters in jobs
- Locating job failures using Jobs UI
Dashboards
- Databricks SQL Endpoints
- Creating SQL endpoints for different use cases
- Query Scheduling
- Scheduling query based on scenarios
- Query reruns based on interval time
- Alerting
- Configure notifications for different conditions
- Configure and manage alerts for failure
- Refreshing
- Scheduling dashboard refreshes
- Query reruns impact on dashboard performance
Self-assessment for Production Pipelines
Objectives | Options |
---|---|
Configure an alert in case of values not meeting a condition | Very under-prepared, Somewhat under-prepared, Prepared |
Describe causes for slow dashboard performance | Very under-prepared, Somewhat under-prepared, Prepared |
Data Governance (9%)
- Unity Catalog (benefits and features)
- Entity Permissions (team-based permissions, user-based permissions)
Unity Catalog
- Benefits of Unity Catalog
- Unity Catalog Features
Entity Permissions
- Configuring access to production tables and database
- Granting different levels of permissions to for users and groups
Self-assessment for Data Governance
Objectives | Options |
---|---|
Describe how Unity Catalog handles security | Very under-prepared, Somewhat under-prepared, Prepared |
Assign full access to a production table for a specific group | Very under-prepared, Somewhat under-prepared, Prepared |