Data Engineering with Databricks v2  [draft]

00 - General Databricks documentation, link Data Engineering With Databricks, GitHub 01 - Databricks Workspace and Services Databricks Architecture and Services Databricks Control Plane Web Application Databricks SQL Databricks Machine Learning Databricks Data Science and Engineering Repos / Notebooks Job Scheduling Cluster Management Cluster Cluster are made up of one or more virtual machine (VMs) instances Driver node. Coordinate activities of executors, aka master node in EMR. Executor node. Run tasks composing a Spark job, aka run node in EMR....

March 7, 2023 · 30 min · 6268 words · Eric

Exam Guide - Databricks Certified Data Engineer Associate

General Databricks Certified Data Engineer Associate: link Time allotted to complex exam is 1.5 hours (90 minutes) Exam fee $200 USD Number of questions 45 Passing scores is at least 70% on the overall exam Code Example data manipulation code will be in SQL when possible Structured Streaming code will be in Python Runtime version is DBR 10.4 LTS Practice Exam: link Expectation Databricks Lakehouse Platform (24%). Understand how to use and the benefits of using the Databricks Lakehouse Platform and its tools....

February 20, 2023 · 7 min · 1294 words · Eric

Databricks Learning Path

Learning Path what is Databricks ( link ) what is Databricks Lakehouse ( link ) what are the ACID guarantees on Databricks ( link ) What is the medallion Lakehouse architecture ( link ) Databricks Architecture ( link ) Launching a Databricks all-purpose compute cluster ( link ) Creating a Databricks notebook ( link ) Executing notebook cells to process, query, and preview data ( link ) Create, run, and manage Databricks Jobs ( link ) Configuring incremental data ingestion to Delta Lake ( link ) Scheduling a notebook as a Databricks job ( link ) Databricks SQL ( link )

February 8, 2023 · 1 min · 101 words · Eric

Databricks Glossary

A ACL Access Control List (ACL). Auto Compaction Auto Compaction is part of the Auto Optimise feature in Databricks. It checks after an individual write, if files can further be compacted, if yes it runs an OPTIMISE job with 128MB file sizes instead of 1GB file size used in the standard OPTIMISE. Auto Compaction use 128MB for compacting files, while OPTIMISE command use 1GB. Auto Loader Auto Loader monitors a source location, in which files accumulate, to identify and ingest only new arriving files with each command run....

February 1, 2023 · 5 min · 966 words · Eric

dbt Fundamentals  [draft]

0 - General dbt Fundamentals dbt Developer Hub 1 - Who is an Analytics Engineer? Traditional Data Teams Data Engineers Data Engineers are in charge of building the infrastructure that data is hosted on, usually databases. DE also manage the ETL process to ensure data is where it needs to be, and in tables for Data Analysts to query. Skill set for Data Engineers include SQL, Python, java, other functional programming languages....

December 2, 2022 · 3 min · 488 words · Eric