General
- Databricks Certified Data Engineer Professional: link
- Time allotted to complex exam is 2 hours (120 minutes)
- Exam fee $200 USD
- Number of questions 60
- Question type: multiple choice questions
- Passing scores is at least 70% on the overall exam
- Code Example
- data manipulation code will be in SQL when possible
- Structured Streaming code will be in Python
- Runtime version is DBR 10.4 LTS
- Practice Exam: link
Target Audience
- Data Engineer, >= 2yoe
- Advanced, practitioner certification
- Assess candidates at a level equivalent to two or more years with data engineering with Databricks
Expectation
- Understanding of the Databricks platform and developer tools
- Ability to build optimised and cleaned data processing pipelines using the Spark and Delta Lake APIs
- Ability to model data into a Lakehouse using knowledge of general data modeling concepts
- Ability to make data pipelines secure, reliable, monitored, and tested before deployment
Out of Scope
The following is not expected of a Professional-level data engineer:
- Terraform (infrastructure as code)
- Cloud-specific security and integrations
- Automation servers (Jenkins, Azure DevOps, etc)
- Orchestration tools (Airflow, ADF, etc)
- Kafka (and specifications of other pub/sub systems)
- Gitflow
- Managed CI/CD
- Scala
- Delta Live Tables
Exam Topics
- Databricks Tooling (20%).
- Data Processing (30%).
- Data Modeling (20%).
- Security & Governance (10%).
- Monitoring & Logging (10%).
- Testing & Deployment (10%).
Databricks Tooling (20%, 12/60)
Understand how to use and the benefits of using the Databricks platform and its tools, including:
- Web App: notebooks, clusters, jobs, DBSQL, relational entities, Repos
- Databricks APIs: DBUtils, MLflow, magic commands
- Apache Spark, Delta Lake, Databricks CLI and REST API
Web App
- Notebooks
- Develop and manually execute Spark code.
- Edit to eliminate unnecessary compute in production workloads.
- Clusters
- Configure Spark configurations, libraries, security settings for interactive development.
- Examine event logs, driver logs, metrics, Spark UI for potential execution problems.
- Jobs
- Schedule notebooks to run on ephemeral clusters.
- Use chronological triggers.
- Orchestrate multiple tasks as a DAG.
- DBSQL
- Configure dashboards for visual reporting and monitoring.
- Configure email alerts when particular conditions are met.
- Relational Entities (Unity Catalog)
- Create databases, tables, and views in user-specified locations.
- Differences between managed and external tables during table manipulation.
- Repos
- Create a new branch and commit changes to external Git provider.
- Pull changes from external Git provider into a workspace.
Databricks APIs
- DBUtils
- Mount external storage locations for reading and writing
- Interact with files stored in mounted containers
- MLflow
- Load and apply a model registered in MLflow as part of an ETL pipeline step
- Store predictions from a model in MLflow in a Delta table
- Magic Commands
- Execute code from other languages in notebooks
- Run bash commands on the driver
Other APIs
- Spark APIs
- Apply common built-in PySpark functions to accomplish ETL tasks
- Delta Lake API
- Create Delta tables
- Read from and write to Delta tables
- Optimise vacuum, zorder, merge, clone Delta tables
- Databricks CLI
- Build and deploy basic notebook-centric CI/CD through command-line tools and high-level APIs
- Databricks REST API
- Configure and trigger production pipelines and infrastructure
Self Assessment for Databricks Tooling
- Identify potential execution problems by examining event logs
- Orchestrate multiple tasks as a DAG
- Configure dashboards for visually reporting and monitoring
- Identify differences between managed and external tables during dropping and renaming
- Load and apply a model registered in MLflow as part of an ETL pipeline step
- Run bash commands on the driver
Data Processing (30%, 18/60)
Build data processing pipelines using the Spark and Delta Lake APIs, including:
- Building batch-processed ETL pipelines
- Building incrementally processed ETL pipelines
- Optimising workloads
- Deduplicating data
- Using Change Data Capture (CDC) to propagate changes
Building batch-processed ETL pipelines
- Use batch reads and writes to extract, transform and load data from various data sources
- Processing upstream CDC data
- Propagating changes from batch jobs
- Gold table output
- Detecting changes in the Lakehouse
- Join orders & hints for performance
- Optimising partition size for writes
Building incrementally processed ETL pipelines
- Use Structured Streaming APIs to perform incremental data processing with Spark and Delta Lake
- Configure AutoLoader to incrementally ingest data from object storage
- Structured Streaming concepts
- Watermark / Window
- Checkpointing
- Stream trigger intervals
- Stream-static joins
- AutoCompaction / AUtoOptimise
Optimising workloads
Identify root cause of low queries and spark jobs using the Spark UI
- Adequate partition sizes
- Joins types
- CPU vs IO bottleneck
- Spill / Skew
- Instance Types
Deduplicating data
- Perform insert-only merges
- Drop duplicate rows
- Use cases and limitations
Using Change Data Capture (CDC) to propagate changes
- Propagate changes using Change Data Feed (CDF) with incremental or batch
Data Modeling (20%, 12/60)
Model data management solutions, including:
- Lakehouse concepts
- Bronze-silver-gold architecture
- Databases, tables, views
- Optimising physical layout
- General data modelling concepts
- Keys, constrains, lookup tables, slowly changing dimensions
Bronze-Silver-Gold Architecture
Build tables for a Lakehouse using the bronze-silver-gold architecture
- Design and build standard Bronze-Silver-Gold data serving through Delta and Structured Streaming
- Write PySpark queries to produce tables of different data quality levels
- Ingesting Bronze: Auto Loader, batch append
- Promoting to Silver: Data validation, data flattening, column reordering
- Creating Gold: Aggregate with complete output: join from multiple tables
Databases, Tables, and Views
- Database Configuration
- Define database locations to set default locations for managed tables
- External Delta Tables
- Create external Delta Lake tables
- Views for Access Control
- Control permissions for end users on gold and silver tables with views
Optimising Physical Layout
- Partitioning Tables
- Identify appropriate columns for partitioning
- Use generated columns
- Directory Structures
- Bronze vs silver database locations
- Location for managed gold tables and views
- Cloud Storage
- Impacts of multiple tables/databases in single storage account/container
- Cross-region read/write costs and latency
General Data Modelling Concepts
- Keys
- Implement tables avoiding issues caused by lack of foreign key constraints
- Constraints
- Add constraints to Delta Lake tables to prevent bad data from being written
- Lookup tables
- Implement lookup tables and describe the trade-offs for normalised data models
- Slowly Changing Dimensions
- Implement Type 0, 1, and 2 SCD tables
Security & Governance (10%, 6/60)
Build production pipelines using best practices around security and governance, including:
- Managing notebook and jobs permissions with ACLs
- Creating row- and column-oriented dynamic views to control user/group access
- Securely storing personally identifiable information (PII)
- Securely delete data as requested according to GDPR & CCPA
ACLs
- Manage permissions on production notebooks and jobs using ACLs
Dynamic Views
- Create row and column oriented dynamic views to control user/group access to sensitive data
PII
- Store PII securely; indicate which fields in a dataset are potentially PII
GDPR &CCPA
- Implement a solution to securely delete data as requested within the required time
Monitoring & Logging (10%, 6/60)
Configure alerting and storage to monitor and log production jobs, including:
- Setting up notifications
- Configuring SparkListener
- Recording logged metrics
- Navigating and interpreting the Spark UI
- Debugging errors
Monitoring Jobs
- Configure email alerts for job events (start, success, failure)
- Deliver logging metrics to defined location
- Audit job attribution using SparkListener for streaming and batch workloads
Debugging
- Parse Spark logs to identify bugs in production code, transient errors, issues with cloud resources, data quality issues
Spark UI Troubleshooting
- Troubleshoot interactive workloads in development
- Identify bottlenecks in production
Testing & Deployment (10%, 6/60)
- Managing dependencies
- Creating unit tests
- Creating integration tests
- Scheduling jobs
- Versioning code / notebooks
- Orchestrating Jobs
Manage Dependencies
- Simplify dependency management with custom functions, classes, libraries
- Manage environments with custom and OSS libraries at notebook or cluster level
Create Tests
- Build unit tests in Python
- Write testable Python functions and classes
- Build integration testing for full-pipeline work
Version Control with Repos
- Sync notebook code with external Git providers
- Sync customer libraries between local and cloud development environments
- Promote code to production environments
Schedule & Orchestrate Jobs
- Schedule jobs to run chronologically using the Jobs UI/CLI/API
- Use REST API calls to trigger jobs programmatically
- Orchestrate multiple notebook-based tasks with linear and branching dependencies