General

  • Databricks Certified Data Engineer Associate: link
  • Time allotted to complex exam is 1.5 hours (90 minutes)
  • Exam fee $200 USD
  • Number of questions 45
  • Passing scores is at least 70% on the overall exam
  • Code Example
    • data manipulation code will be in SQL when possible
    • Structured Streaming code will be in Python
    • Runtime version is DBR 10.4 LTS
  • Practice Exam: link

Expectation

  1. Databricks Lakehouse Platform (24%). Understand how to use and the benefits of using the Databricks Lakehouse Platform and its tools.
  2. ETL with Spark SQL and Python (29%). Build ETL pipeline using Apache Spark SQL and Python.
  3. Incremental Data Processing (22%). Incrementally process data with Structured Streaming, and Delta Live Table.
  4. Production Pipelines (16%).Build production pipelines for data engineering applications, and Databricks SQL queries and dashboards.
  5. Data Governance (9%). Understand and follow best security practices.

Out-of-scope

  1. Apache Spark Internals
  2. Databricks CLI
  3. Databricks REST API
  4. Change Data Capture
  5. Data modeling concepts
  6. Notebooks and Job permissions
  7. Personal Identified Information (PII)
  8. GDPR / CCPA
  9. Monitoring and logging production jobs
  10. Dependency management
  11. Testing

Data Lakehouse Platform (24%)

  • Lakehouse: Lakehouse description, Lakehouse Benefits to Data Teams
    • Data Lakehouse vs Data Warehouse
    • Data Lakehouse vs Data Lake
    • Data Quality Improvements
  • Data Science and Engineering Workspace: Clusters, DBFS, Notebooks, Repos
    • High level architecture
    • Key components of a workspace deployment
    • Core services in a Databricks workspace deployment
  • Delta Lake: General Concepts, Table Management, Table Manipulation, Optimisations
    • Organisational data problems resolved with Lakehouse
    • Benefits to different roles in a data team

Databricks Lakehouse Concepts

  • Lakehouse Concepts
    • Data Lakehouse vs Data Warehouse
    • Data Lakehouse vs Data Lake
    • Data Quality Improvements
  • Platform Architecture
    • High level architecture and key components of a workspace deployment
    • Core services in a Databricks workspace deployment
  • Benefits to Data Teams
    • Organisational data platforms solved with Lakehouse
    • Benefits to different roles in a data team

Data Science and Engineering Workspace

  • Clusters
    • All-purpose clusters vs jobs clusters
    • Cluster instance and pools
  • Databricks File System (DBFS)
    • Managing permission on tables
    • Role permission and functions
    • Data Explorer
  • Notebooks
    • Features and limitations
    • Collaboration best practices
  • Repos
    • Supported features and Git operations
    • Relevance in CI/CD workflows in Databricks

Delta Lake

  • General Concepts
    • ACID transactions on a data lake
    • Features and benefits of Delta Lake
  • Table Management & Manipulation
    • Creating tables
    • Managing files
    • Writing to tables
    • Dropping tables
  • Optimisations
    • Supported features and benefits
    • Table utilities to manage files

Self-assessment for Data Lakehouse Platform

ObjectivesOptions
Identify how the data lakehouse solves a common organisational data problemVery under-prepared, Somewhat under-prepared, Prepared
Describe the most efficient write operation for a specific taskVery under-prepared, Somewhat under-prepared, Prepared
Identify limitations in Databricks Notebooks version control functionality relative to ReposVery under-prepared, Somewhat under-prepared, Prepared
Identify why Z-ordering is beneficial to Delta Lake tablesVery under-prepared, Somewhat under-prepared, Prepared

ELT with Spark SQL and Python (29%)

Build ETL pipelines using Apache Spark SQL and Python

  • Relational entities (databases, tables, views)
  • ELT
    • creating tables
    • writing data to tables
    • transforming data
    • UDFs
  • Manipulating data with Spark SQL and Python

Relational Entities

Leverage Spark SQL DDL to create and manipulate relational entities on Databricks

  • Databases
    • Create databases in specific locations
    • Retrieve locations of existing databases
    • Modify and delete databases
  • Tables
    • Managed vs external tables
    • Create and drop managed and external tables
    • Query and modify managed and external tables
  • Views and CTEs
    • Views vs Temporary views
    • Views vs Delta Lake tables
    • Creating views and CTEs

ELT: Extract & Load Data into Delta Lake

Use SPark SQL to extract, load, and transform data to support production workloads and analytics in the Lakehouse

  • Creating tables
    • External sources vs Delta Lake tables
    • Methods to create tables and use cases
    • Delta table configurations
    • Different file formats and data sources
    • Create Tables as Select Statements
  • Writing Data to Tables
    • Methods to write to tables and use cases
    • Efficiency for different operations
    • Resulting behaviours in target tables

ELT: Use Spark SQL to Transform Data

  • Cleaning Data with Common SQL
    • Methods to deduplicate data
    • Common cleaning methods for different types of data
  • Combining Data
    • Join types and strategies
    • Set operators and applications
  • Reshaping Data
    • Different operations to transform arrays
    • Benefits of array functions
    • Applying higher-order functions
  • Advanced Operations
    • Manipulating nested data fields
    • Applying SQL UDFs for custom transformations

Just Enough Python

Leverage PySpark for advanced code functionality needed in production applications

  • Spark SQL Queries
    • Executing Spark SQL Queries in PySpark
  • Passing Data to SQL
    • Temporary views
    • Converting tables to and from DataFrames
  • Python Syntax
    • Functions and variables
    • Control flow
    • Error handling

Self-assessment for ELT with Spark SQL and Python

ObjectivesOptions
Load and parse nested data from JSON files into a Delta tableVery under-prepared, Somewhat under-prepared, Prepared
Describe the most efficient write operation for a specific taskVery under-prepared, Somewhat under-prepared, Prepared
Explode and flatten arrays in a datasetVery under-prepared, Somewhat under-prepared, Prepared
Apply a SQL UDF to complete a custom taskVery under-prepared, Somewhat under-prepared, Prepared

Incremental Data Processing (22%)

  • Structured Streaming (general concepts, triggers, watermarks)
  • Auto Loader (streaming reads)
  • Multi-hop Architecture (bronze-silver-gold, streaming applications)
  • Delta Live Tables (benefits and features)

Structured Streaming

  • General Concepts
    • Programming model
    • Configuration for reads and writes
    • End-to-end fault tolerance
    • Interacting with streaming queries
  • Triggers
    • Set up streaming writes with different trigger behaviours
  • Watermarks
    • Unsupported operations on streaming data
    • Scenarios in which watermarking data would be necessary
    • Managing state with watermarking

Auto Loader

Incrementally process data to power analytics insights with Spark Structured Streaming and AutoLoader

  • Define streaming reads with Auto Loader and PySpark to load data into Delta
  • Define streaming reads on tables for SQL manipulation
  • Identifying source locations
  • Use cases for using Auto Loader

Multi-hop Architecture

Propagate new data through multiple tables in the data lakehouse

  • Bronze
    • Bronze vs raw tables
    • Workloads using bronze tables as source
  • Sliver & Gold
    • Silver vs gold tables
    • Workloads using silver table as source
  • Structured Streaming in Multi-hop
    • Converting data from bronze to silver levels with validations
    • Converting data from silver to gold levels with aggregations

Delta Live Tables

  • General Concepts
    • Benefits of using Delta Live Tables for ETL
    • Scenarios that benefits from Delta Live Tables
  • UI
    • Deploying DLT pipelines from notebooks
    • Executing updates
    • Explore and evaluate results from DLT pipelines
  • SQL Syntax
    • Converting SQL definitions to Auto Loader syntax
    • Common differences in DLT SQL syntax

Self-assessment for Incremental Data Processing

ObjectivesOptions
Set up a structured streaming writing with specified configurationsVery under-prepared, Somewhat under-prepared, Prepared
Describe the use of bronze, silver, or gold tables for different workloadsVery under-prepared, Somewhat under-prepared, Prepared
Deploy a DLT pipeline from an existing notebookVery under-prepared, Somewhat under-prepared, Prepared

Production Pipelines (16%)

Building production pipelines for data engineering applications and Databricks SQL queries and dashboards, including

  • Workflows (Job scheduling, task orchestration, UI)
  • Dashboards (endpoints, scheduling, alerting, refreshing)

Workflows

  • Automation
    • Setting up retry policies
    • Using cluster pools and why
  • Task Orchestration
    • Benefits of using multiple tasks in Job
    • Configuring predecessor tasks
  • UI
    • Using notebook parameters in jobs
    • Locating job failures using Jobs UI

Dashboards

  • Databricks SQL Endpoints
    • Creating SQL endpoints for different use cases
  • Query Scheduling
    • Scheduling query based on scenarios
    • Query reruns based on interval time
  • Alerting
    • Configure notifications for different conditions
    • Configure and manage alerts for failure
  • Refreshing
    • Scheduling dashboard refreshes
    • Query reruns impact on dashboard performance

Self-assessment for Production Pipelines

ObjectivesOptions
Configure an alert in case of values not meeting a conditionVery under-prepared, Somewhat under-prepared, Prepared
Describe causes for slow dashboard performanceVery under-prepared, Somewhat under-prepared, Prepared

Data Governance (9%)

  • Unity Catalog (benefits and features)
  • Entity Permissions (team-based permissions, user-based permissions)

Unity Catalog

  • Benefits of Unity Catalog
  • Unity Catalog Features

Entity Permissions

  • Configuring access to production tables and database
  • Granting different levels of permissions to for users and groups

Self-assessment for Data Governance

ObjectivesOptions
Describe how Unity Catalog handles securityVery under-prepared, Somewhat under-prepared, Prepared
Assign full access to a production table for a specific groupVery under-prepared, Somewhat under-prepared, Prepared