General

  • Databricks Certified Data Engineer Professional: link
  • Time allotted to complex exam is 2 hours (120 minutes)
  • Exam fee $200 USD
  • Number of questions 60
  • Question type: multiple choice questions
  • Passing scores is at least 70% on the overall exam
  • Code Example
    • data manipulation code will be in SQL when possible
    • Structured Streaming code will be in Python
    • Runtime version is DBR 10.4 LTS
  • Practice Exam: link

Target Audience

  • Data Engineer, >= 2yoe
  • Advanced, practitioner certification
  • Assess candidates at a level equivalent to two or more years with data engineering with Databricks

Expectation

  • Understanding of the Databricks platform and developer tools
  • Ability to build optimised and cleaned data processing pipelines using the Spark and Delta Lake APIs
  • Ability to model data into a Lakehouse using knowledge of general data modeling concepts
  • Ability to make data pipelines secure, reliable, monitored, and tested before deployment

Out of Scope

The following is not expected of a Professional-level data engineer:

  1. Terraform (infrastructure as code)
  2. Cloud-specific security and integrations
  3. Automation servers (Jenkins, Azure DevOps, etc)
  4. Orchestration tools (Airflow, ADF, etc)
  5. Kafka (and specifications of other pub/sub systems)
  6. Gitflow
  7. Managed CI/CD
  8. Scala
  9. Delta Live Tables

Exam Topics

  1. Databricks Tooling (20%).
  2. Data Processing (30%).
  3. Data Modeling (20%).
  4. Security & Governance (10%).
  5. Monitoring & Logging (10%).
  6. Testing & Deployment (10%).

Databricks Tooling (20%, 12/60)

Understand how to use and the benefits of using the Databricks platform and its tools, including:

  1. Web App: notebooks, clusters, jobs, DBSQL, relational entities, Repos
  2. Databricks APIs: DBUtils, MLflow, magic commands
  3. Apache Spark, Delta Lake, Databricks CLI and REST API

Web App

  • Notebooks
    • Develop and manually execute Spark code.
    • Edit to eliminate unnecessary compute in production workloads.
  • Clusters
    • Configure Spark configurations, libraries, security settings for interactive development.
    • Examine event logs, driver logs, metrics, Spark UI for potential execution problems.
  • Jobs
    • Schedule notebooks to run on ephemeral clusters.
    • Use chronological triggers.
    • Orchestrate multiple tasks as a DAG.
  • DBSQL
    • Configure dashboards for visual reporting and monitoring.
    • Configure email alerts when particular conditions are met.
  • Relational Entities (Unity Catalog)
    • Create databases, tables, and views in user-specified locations.
    • Differences between managed and external tables during table manipulation.
  • Repos
    • Create a new branch and commit changes to external Git provider.
    • Pull changes from external Git provider into a workspace.

Databricks APIs

  • DBUtils
    • Mount external storage locations for reading and writing
    • Interact with files stored in mounted containers
  • MLflow
    • Load and apply a model registered in MLflow as part of an ETL pipeline step
    • Store predictions from a model in MLflow in a Delta table
  • Magic Commands
    • Execute code from other languages in notebooks
    • Run bash commands on the driver

Other APIs

  • Spark APIs
    • Apply common built-in PySpark functions to accomplish ETL tasks
  • Delta Lake API
    • Create Delta tables
    • Read from and write to Delta tables
    • Optimise vacuum, zorder, merge, clone Delta tables
  • Databricks CLI
    • Build and deploy basic notebook-centric CI/CD through command-line tools and high-level APIs
  • Databricks REST API
    • Configure and trigger production pipelines and infrastructure

Self Assessment for Databricks Tooling

  • Identify potential execution problems by examining event logs
  • Orchestrate multiple tasks as a DAG
  • Configure dashboards for visually reporting and monitoring
  • Identify differences between managed and external tables during dropping and renaming
  • Load and apply a model registered in MLflow as part of an ETL pipeline step
  • Run bash commands on the driver

Data Processing (30%, 18/60)

Build data processing pipelines using the Spark and Delta Lake APIs, including:

  1. Building batch-processed ETL pipelines
  2. Building incrementally processed ETL pipelines
  3. Optimising workloads
  4. Deduplicating data
  5. Using Change Data Capture (CDC) to propagate changes

Building batch-processed ETL pipelines

  1. Use batch reads and writes to extract, transform and load data from various data sources
  2. Processing upstream CDC data
  3. Propagating changes from batch jobs
  4. Gold table output
  5. Detecting changes in the Lakehouse
  6. Join orders & hints for performance
  7. Optimising partition size for writes

Building incrementally processed ETL pipelines

  1. Use Structured Streaming APIs to perform incremental data processing with Spark and Delta Lake
  2. Configure AutoLoader to incrementally ingest data from object storage
  3. Structured Streaming concepts
  4. Watermark / Window
  5. Checkpointing
  6. Stream trigger intervals
  7. Stream-static joins
  8. AutoCompaction / AUtoOptimise

Optimising workloads

Identify root cause of low queries and spark jobs using the Spark UI

  1. Adequate partition sizes
  2. Joins types
  3. CPU vs IO bottleneck
  4. Spill / Skew
  5. Instance Types

Deduplicating data

  1. Perform insert-only merges
  2. Drop duplicate rows
  3. Use cases and limitations

Using Change Data Capture (CDC) to propagate changes

  • Propagate changes using Change Data Feed (CDF) with incremental or batch

Data Modeling (20%, 12/60)

Model data management solutions, including:

  • Lakehouse concepts
    • Bronze-silver-gold architecture
    • Databases, tables, views
    • Optimising physical layout
  • General data modelling concepts
    • Keys, constrains, lookup tables, slowly changing dimensions

Bronze-Silver-Gold Architecture

Build tables for a Lakehouse using the bronze-silver-gold architecture

  • Design and build standard Bronze-Silver-Gold data serving through Delta and Structured Streaming
  • Write PySpark queries to produce tables of different data quality levels
  • Ingesting Bronze: Auto Loader, batch append
  • Promoting to Silver: Data validation, data flattening, column reordering
  • Creating Gold: Aggregate with complete output: join from multiple tables

Databases, Tables, and Views

  • Database Configuration
    • Define database locations to set default locations for managed tables
  • External Delta Tables
    • Create external Delta Lake tables
  • Views for Access Control
    • Control permissions for end users on gold and silver tables with views

Optimising Physical Layout

  • Partitioning Tables
    • Identify appropriate columns for partitioning
    • Use generated columns
  • Directory Structures
    • Bronze vs silver database locations
    • Location for managed gold tables and views
  • Cloud Storage
    • Impacts of multiple tables/databases in single storage account/container
    • Cross-region read/write costs and latency

General Data Modelling Concepts

  • Keys
    • Implement tables avoiding issues caused by lack of foreign key constraints
  • Constraints
    • Add constraints to Delta Lake tables to prevent bad data from being written
  • Lookup tables
    • Implement lookup tables and describe the trade-offs for normalised data models
  • Slowly Changing Dimensions
    • Implement Type 0, 1, and 2 SCD tables

Security & Governance (10%, 6/60)

Build production pipelines using best practices around security and governance, including:

  1. Managing notebook and jobs permissions with ACLs
  2. Creating row- and column-oriented dynamic views to control user/group access
  3. Securely storing personally identifiable information (PII)
  4. Securely delete data as requested according to GDPR & CCPA

ACLs

  • Manage permissions on production notebooks and jobs using ACLs

Dynamic Views

  • Create row and column oriented dynamic views to control user/group access to sensitive data

PII

  • Store PII securely; indicate which fields in a dataset are potentially PII

GDPR &CCPA

  • Implement a solution to securely delete data as requested within the required time

Monitoring & Logging (10%, 6/60)

Configure alerting and storage to monitor and log production jobs, including:

  1. Setting up notifications
  2. Configuring SparkListener
  3. Recording logged metrics
  4. Navigating and interpreting the Spark UI
  5. Debugging errors

Monitoring Jobs

  1. Configure email alerts for job events (start, success, failure)
  2. Deliver logging metrics to defined location
  3. Audit job attribution using SparkListener for streaming and batch workloads

Debugging

  • Parse Spark logs to identify bugs in production code, transient errors, issues with cloud resources, data quality issues

Spark UI Troubleshooting

  • Troubleshoot interactive workloads in development
  • Identify bottlenecks in production

Testing & Deployment (10%, 6/60)

  1. Managing dependencies
  2. Creating unit tests
  3. Creating integration tests
  4. Scheduling jobs
  5. Versioning code / notebooks
  6. Orchestrating Jobs

Manage Dependencies

  1. Simplify dependency management with custom functions, classes, libraries
  2. Manage environments with custom and OSS libraries at notebook or cluster level

Create Tests

  1. Build unit tests in Python
  2. Write testable Python functions and classes
  3. Build integration testing for full-pipeline work

Version Control with Repos

  1. Sync notebook code with external Git providers
  2. Sync customer libraries between local and cloud development environments
  3. Promote code to production environments

Schedule & Orchestrate Jobs

  1. Schedule jobs to run chronologically using the Jobs UI/CLI/API
  2. Use REST API calls to trigger jobs programmatically
  3. Orchestrate multiple notebook-based tasks with linear and branching dependencies