Exam Guide - Databricks Certified Data Engineer Associate

General

Databricks Certified Data Engineer Associate: link
Time allotted to complex exam is 1.5 hours (90 minutes)
Exam fee $200 USD
Number of questions 45
Passing scores is at least 70% on the overall exam
Code Example
- data manipulation code will be in SQL when possible
- Structured Streaming code will be in Python
- Runtime version is DBR 10.4 LTS
Practice Exam: link

Expectation

Databricks Lakehouse Platform (24%). Understand how to use and the benefits of using the Databricks Lakehouse Platform and its tools.
ETL with Spark SQL and Python (29%). Build ETL pipeline using Apache Spark SQL and Python.
Incremental Data Processing (22%). Incrementally process data with Structured Streaming, and Delta Live Table.
Production Pipelines (16%).Build production pipelines for data engineering applications, and Databricks SQL queries and dashboards.
Data Governance (9%). Understand and follow best security practices.

Out-of-scope

Apache Spark Internals
Databricks CLI
Databricks REST API
Change Data Capture
Data modeling concepts
Notebooks and Job permissions
Personal Identified Information (PII)
GDPR / CCPA
Monitoring and logging production jobs
Dependency management
Testing

Data Lakehouse Platform (24%)

Lakehouse: Lakehouse description, Lakehouse Benefits to Data Teams
- Data Lakehouse vs Data Warehouse
- Data Lakehouse vs Data Lake
- Data Quality Improvements
Data Science and Engineering Workspace: Clusters, DBFS, Notebooks, Repos
- High level architecture
- Key components of a workspace deployment
- Core services in a Databricks workspace deployment
Delta Lake: General Concepts, Table Management, Table Manipulation, Optimisations
- Organisational data problems resolved with Lakehouse
- Benefits to different roles in a data team

Databricks Lakehouse Concepts

Lakehouse Concepts
- Data Lakehouse vs Data Warehouse
- Data Lakehouse vs Data Lake
- Data Quality Improvements
Platform Architecture
- High level architecture and key components of a workspace deployment
- Core services in a Databricks workspace deployment
Benefits to Data Teams
- Organisational data platforms solved with Lakehouse
- Benefits to different roles in a data team

Data Science and Engineering Workspace

Clusters
- All-purpose clusters vs jobs clusters
- Cluster instance and pools
Databricks File System (DBFS)
- Managing permission on tables
- Role permission and functions
- Data Explorer
Notebooks
- Features and limitations
- Collaboration best practices
Repos
- Supported features and Git operations
- Relevance in CI/CD workflows in Databricks

Delta Lake

General Concepts
- ACID transactions on a data lake
- Features and benefits of Delta Lake
Table Management & Manipulation
- Creating tables
- Managing files
- Writing to tables
- Dropping tables
Optimisations
- Supported features and benefits
- Table utilities to manage files

Self-assessment for Data Lakehouse Platform

Objectives	Options
Identify how the data lakehouse solves a common organisational data problem	Very under-prepared, Somewhat under-prepared, Prepared
Describe the most efficient write operation for a specific task	Very under-prepared, Somewhat under-prepared, Prepared
Identify limitations in Databricks Notebooks version control functionality relative to Repos	Very under-prepared, Somewhat under-prepared, Prepared
Identify why Z-ordering is beneficial to Delta Lake tables	Very under-prepared, Somewhat under-prepared, Prepared

ELT with Spark SQL and Python (29%)

Build ETL pipelines using Apache Spark SQL and Python

Relational entities (databases, tables, views)
ELT
- creating tables
- writing data to tables
- transforming data
- UDFs
Manipulating data with Spark SQL and Python

Relational Entities

Leverage Spark SQL DDL to create and manipulate relational entities on Databricks

Databases
- Create databases in specific locations
- Retrieve locations of existing databases
- Modify and delete databases
Tables
- Managed vs external tables
- Create and drop managed and external tables
- Query and modify managed and external tables
Views and CTEs
- Views vs Temporary views
- Views vs Delta Lake tables
- Creating views and CTEs

ELT: Extract & Load Data into Delta Lake

Use SPark SQL to extract, load, and transform data to support production workloads and analytics in the Lakehouse

Creating tables
- External sources vs Delta Lake tables
- Methods to create tables and use cases
- Delta table configurations
- Different file formats and data sources
- Create Tables as Select Statements
Writing Data to Tables
- Methods to write to tables and use cases
- Efficiency for different operations
- Resulting behaviours in target tables

ELT: Use Spark SQL to Transform Data

Cleaning Data with Common SQL
- Methods to deduplicate data
- Common cleaning methods for different types of data
Combining Data
- Join types and strategies
- Set operators and applications
Reshaping Data
- Different operations to transform arrays
- Benefits of array functions
- Applying higher-order functions
Advanced Operations
- Manipulating nested data fields
- Applying SQL UDFs for custom transformations

Just Enough Python

Leverage PySpark for advanced code functionality needed in production applications

Spark SQL Queries
- Executing Spark SQL Queries in PySpark
Passing Data to SQL
- Temporary views
- Converting tables to and from DataFrames
Python Syntax
- Functions and variables
- Control flow
- Error handling

Self-assessment for ELT with Spark SQL and Python

Objectives	Options
Load and parse nested data from JSON files into a Delta table	Very under-prepared, Somewhat under-prepared, Prepared
Describe the most efficient write operation for a specific task	Very under-prepared, Somewhat under-prepared, Prepared
Explode and flatten arrays in a dataset	Very under-prepared, Somewhat under-prepared, Prepared
Apply a SQL UDF to complete a custom task	Very under-prepared, Somewhat under-prepared, Prepared

Incremental Data Processing (22%)

Structured Streaming (general concepts, triggers, watermarks)
Auto Loader (streaming reads)
Multi-hop Architecture (bronze-silver-gold, streaming applications)
Delta Live Tables (benefits and features)

Structured Streaming

General Concepts
- Programming model
- Configuration for reads and writes
- End-to-end fault tolerance
- Interacting with streaming queries
Triggers
- Set up streaming writes with different trigger behaviours
Watermarks
- Unsupported operations on streaming data
- Scenarios in which watermarking data would be necessary
- Managing state with watermarking

Auto Loader

Incrementally process data to power analytics insights with Spark Structured Streaming and AutoLoader

Define streaming reads with Auto Loader and PySpark to load data into Delta
Define streaming reads on tables for SQL manipulation
Identifying source locations
Use cases for using Auto Loader

Multi-hop Architecture

Propagate new data through multiple tables in the data lakehouse

Bronze
- Bronze vs raw tables
- Workloads using bronze tables as source
Sliver & Gold
- Silver vs gold tables
- Workloads using silver table as source
Structured Streaming in Multi-hop
- Converting data from bronze to silver levels with validations
- Converting data from silver to gold levels with aggregations

Delta Live Tables

General Concepts
- Benefits of using Delta Live Tables for ETL
- Scenarios that benefits from Delta Live Tables
UI
- Deploying DLT pipelines from notebooks
- Executing updates
- Explore and evaluate results from DLT pipelines
SQL Syntax
- Converting SQL definitions to Auto Loader syntax
- Common differences in DLT SQL syntax

Self-assessment for Incremental Data Processing

Objectives	Options
Set up a structured streaming writing with specified configurations	Very under-prepared, Somewhat under-prepared, Prepared
Describe the use of bronze, silver, or gold tables for different workloads	Very under-prepared, Somewhat under-prepared, Prepared
Deploy a DLT pipeline from an existing notebook	Very under-prepared, Somewhat under-prepared, Prepared

Production Pipelines (16%)

Building production pipelines for data engineering applications and Databricks SQL queries and dashboards, including

Workflows (Job scheduling, task orchestration, UI)
Dashboards (endpoints, scheduling, alerting, refreshing)

Workflows

Automation
- Setting up retry policies
- Using cluster pools and why
Task Orchestration
- Benefits of using multiple tasks in Job
- Configuring predecessor tasks
UI
- Using notebook parameters in jobs
- Locating job failures using Jobs UI

Dashboards

Databricks SQL Endpoints
- Creating SQL endpoints for different use cases
Query Scheduling
- Scheduling query based on scenarios
- Query reruns based on interval time
Alerting
- Configure notifications for different conditions
- Configure and manage alerts for failure
Refreshing
- Scheduling dashboard refreshes
- Query reruns impact on dashboard performance

Self-assessment for Production Pipelines

Objectives	Options
Configure an alert in case of values not meeting a condition	Very under-prepared, Somewhat under-prepared, Prepared
Describe causes for slow dashboard performance	Very under-prepared, Somewhat under-prepared, Prepared

Data Governance (9%)

Unity Catalog (benefits and features)
Entity Permissions (team-based permissions, user-based permissions)

Unity Catalog

Benefits of Unity Catalog
Unity Catalog Features

Entity Permissions

Configuring access to production tables and database
Granting different levels of permissions to for users and groups

Self-assessment for Data Governance

Objectives	Options
Describe how Unity Catalog handles security	Very under-prepared, Somewhat under-prepared, Prepared
Assign full access to a production table for a specific group	Very under-prepared, Somewhat under-prepared, Prepared

General#

Expectation#

Out-of-scope#

Data Lakehouse Platform (24%)#

Databricks Lakehouse Concepts#

Data Science and Engineering Workspace#

Delta Lake#

Self-assessment for Data Lakehouse Platform#

ELT with Spark SQL and Python (29%)#

Relational Entities#

ELT: Extract & Load Data into Delta Lake#

ELT: Use Spark SQL to Transform Data#

Just Enough Python#

Self-assessment for ELT with Spark SQL and Python#

Incremental Data Processing (22%)#

Structured Streaming#

Auto Loader#

Multi-hop Architecture#

Delta Live Tables#

Self-assessment for Incremental Data Processing#

Production Pipelines (16%)#

Workflows#

Dashboards#

Self-assessment for Production Pipelines#

Data Governance (9%)#

Unity Catalog#

Entity Permissions#

Self-assessment for Data Governance#