Exam Guide - Databricks Certified Data Engineer Professional

General

Databricks Certified Data Engineer Professional: link
Time allotted to complex exam is 2 hours (120 minutes)
Exam fee $200 USD
Number of questions 60
Question type: multiple choice questions
Passing scores is at least 70% on the overall exam
Code Example
- data manipulation code will be in SQL when possible
- Structured Streaming code will be in Python
- Runtime version is DBR 10.4 LTS
Practice Exam: link

Target Audience

Data Engineer, >= 2yoe
Advanced, practitioner certification
Assess candidates at a level equivalent to two or more years with data engineering with Databricks

Expectation

Understanding of the Databricks platform and developer tools
Ability to build optimised and cleaned data processing pipelines using the Spark and Delta Lake APIs
Ability to model data into a Lakehouse using knowledge of general data modeling concepts
Ability to make data pipelines secure, reliable, monitored, and tested before deployment

Out of Scope

The following is not expected of a Professional-level data engineer:

Terraform (infrastructure as code)
Cloud-specific security and integrations
Automation servers (Jenkins, Azure DevOps, etc)
Orchestration tools (Airflow, ADF, etc)
Kafka (and specifications of other pub/sub systems)
Gitflow
Managed CI/CD
Scala
Delta Live Tables

Exam Topics

Databricks Tooling (20%).
Data Processing (30%).
Data Modeling (20%).
Security & Governance (10%).
Monitoring & Logging (10%).
Testing & Deployment (10%).

Databricks Tooling (20%, 12/60)

Understand how to use and the benefits of using the Databricks platform and its tools, including:

Web App: notebooks, clusters, jobs, DBSQL, relational entities, Repos
Databricks APIs: DBUtils, MLflow, magic commands
Apache Spark, Delta Lake, Databricks CLI and REST API

Web App

Notebooks
- Develop and manually execute Spark code.
- Edit to eliminate unnecessary compute in production workloads.
Clusters
- Configure Spark configurations, libraries, security settings for interactive development.
- Examine event logs, driver logs, metrics, Spark UI for potential execution problems.
Jobs
- Schedule notebooks to run on ephemeral clusters.
- Use chronological triggers.
- Orchestrate multiple tasks as a DAG.
DBSQL
- Configure dashboards for visual reporting and monitoring.
- Configure email alerts when particular conditions are met.
Relational Entities (Unity Catalog)
- Create databases, tables, and views in user-specified locations.
- Differences between managed and external tables during table manipulation.
Repos
- Create a new branch and commit changes to external Git provider.
- Pull changes from external Git provider into a workspace.

Databricks APIs

DBUtils
- Mount external storage locations for reading and writing
- Interact with files stored in mounted containers
MLflow
- Load and apply a model registered in MLflow as part of an ETL pipeline step
- Store predictions from a model in MLflow in a Delta table
Magic Commands
- Execute code from other languages in notebooks
- Run bash commands on the driver

Other APIs

Spark APIs
- Apply common built-in PySpark functions to accomplish ETL tasks
Delta Lake API
- Create Delta tables
- Read from and write to Delta tables
- Optimise vacuum, zorder, merge, clone Delta tables
Databricks CLI
- Build and deploy basic notebook-centric CI/CD through command-line tools and high-level APIs
Databricks REST API
- Configure and trigger production pipelines and infrastructure

Self Assessment for Databricks Tooling

Identify potential execution problems by examining event logs
Orchestrate multiple tasks as a DAG
Configure dashboards for visually reporting and monitoring
Identify differences between managed and external tables during dropping and renaming
Load and apply a model registered in MLflow as part of an ETL pipeline step
Run bash commands on the driver

Data Processing (30%, 18/60)

Build data processing pipelines using the Spark and Delta Lake APIs, including:

Building batch-processed ETL pipelines
Building incrementally processed ETL pipelines
Optimising workloads
Deduplicating data
Using Change Data Capture (CDC) to propagate changes

Building batch-processed ETL pipelines

Use batch reads and writes to extract, transform and load data from various data sources
Processing upstream CDC data
Propagating changes from batch jobs
Gold table output
Detecting changes in the Lakehouse
Join orders & hints for performance
Optimising partition size for writes

Building incrementally processed ETL pipelines

Use Structured Streaming APIs to perform incremental data processing with Spark and Delta Lake
Configure AutoLoader to incrementally ingest data from object storage
Structured Streaming concepts
Watermark / Window
Checkpointing
Stream trigger intervals
Stream-static joins
AutoCompaction / AUtoOptimise

Optimising workloads

Identify root cause of low queries and spark jobs using the Spark UI

Adequate partition sizes
Joins types
CPU vs IO bottleneck
Spill / Skew
Instance Types

Deduplicating data

Perform insert-only merges
Drop duplicate rows
Use cases and limitations

Using Change Data Capture (CDC) to propagate changes

Propagate changes using Change Data Feed (CDF) with incremental or batch

Data Modeling (20%, 12/60)

Model data management solutions, including:

Lakehouse concepts
- Bronze-silver-gold architecture
- Databases, tables, views
- Optimising physical layout
General data modelling concepts
- Keys, constrains, lookup tables, slowly changing dimensions

Bronze-Silver-Gold Architecture

Build tables for a Lakehouse using the bronze-silver-gold architecture

Design and build standard Bronze-Silver-Gold data serving through Delta and Structured Streaming
Write PySpark queries to produce tables of different data quality levels
Ingesting Bronze: Auto Loader, batch append
Promoting to Silver: Data validation, data flattening, column reordering
Creating Gold: Aggregate with complete output: join from multiple tables

Databases, Tables, and Views

Database Configuration
- Define database locations to set default locations for managed tables
External Delta Tables
- Create external Delta Lake tables
Views for Access Control
- Control permissions for end users on gold and silver tables with views

Optimising Physical Layout

Partitioning Tables
- Identify appropriate columns for partitioning
- Use generated columns
Directory Structures
- Bronze vs silver database locations
- Location for managed gold tables and views
Cloud Storage
- Impacts of multiple tables/databases in single storage account/container
- Cross-region read/write costs and latency

General Data Modelling Concepts

Keys
- Implement tables avoiding issues caused by lack of foreign key constraints
Constraints
- Add constraints to Delta Lake tables to prevent bad data from being written
Lookup tables
- Implement lookup tables and describe the trade-offs for normalised data models
Slowly Changing Dimensions
- Implement Type 0, 1, and 2 SCD tables

Security & Governance (10%, 6/60)

Build production pipelines using best practices around security and governance, including:

Managing notebook and jobs permissions with ACLs
Creating row- and column-oriented dynamic views to control user/group access
Securely storing personally identifiable information (PII)
Securely delete data as requested according to GDPR & CCPA

ACLs

Manage permissions on production notebooks and jobs using ACLs

Dynamic Views

Create row and column oriented dynamic views to control user/group access to sensitive data

PII

Store PII securely; indicate which fields in a dataset are potentially PII

Implement a solution to securely delete data as requested within the required time

Monitoring & Logging (10%, 6/60)

Configure alerting and storage to monitor and log production jobs, including:

Setting up notifications
Configuring SparkListener
Recording logged metrics
Navigating and interpreting the Spark UI
Debugging errors

Monitoring Jobs

Configure email alerts for job events (start, success, failure)
Deliver logging metrics to defined location
Audit job attribution using SparkListener for streaming and batch workloads

Debugging

Parse Spark logs to identify bugs in production code, transient errors, issues with cloud resources, data quality issues

Spark UI Troubleshooting

Troubleshoot interactive workloads in development
Identify bottlenecks in production

Testing & Deployment (10%, 6/60)

Managing dependencies
Creating unit tests
Creating integration tests
Scheduling jobs
Versioning code / notebooks
Orchestrating Jobs

Manage Dependencies

Simplify dependency management with custom functions, classes, libraries
Manage environments with custom and OSS libraries at notebook or cluster level

Create Tests

Build unit tests in Python
Write testable Python functions and classes
Build integration testing for full-pipeline work

Version Control with Repos

Sync notebook code with external Git providers
Sync customer libraries between local and cloud development environments
Promote code to production environments

Schedule & Orchestrate Jobs

Schedule jobs to run chronologically using the Jobs UI/CLI/API
Use REST API calls to trigger jobs programmatically
Orchestrate multiple notebook-based tasks with linear and branching dependencies

General#

Target Audience#

Expectation#

Out of Scope#

Exam Topics#

Databricks Tooling (20%, 12/60)#

Web App#

Databricks APIs#

Other APIs#

Self Assessment for Databricks Tooling#

Data Processing (30%, 18/60)#

Building batch-processed ETL pipelines#

Building incrementally processed ETL pipelines#

Optimising workloads#

Deduplicating data#

Using Change Data Capture (CDC) to propagate changes#

Data Modeling (20%, 12/60)#

Bronze-Silver-Gold Architecture#

Databases, Tables, and Views#

Optimising Physical Layout#

General Data Modelling Concepts#

Security & Governance (10%, 6/60)#

ACLs#

Dynamic Views#

PII#

GDPR &CCPA#

Monitoring & Logging (10%, 6/60)#

Testing & Deployment (10%, 6/60)#

Manage Dependencies#

Create Tests#

Version Control with Repos#

Schedule & Orchestrate Jobs#