Databricks Best Practices  [draft]

Table Partitioning You should default to non-partitioned tables for most use cases when working with Delta Lake. Most Delta Lake tables, especially small-to-medium sized data, will not benefit from partitioning. Because partitioning physically separates data files, this approach can result in a small files problem and prevent file compaction, and efficient data skipping. The benefits observed in Hive or HDFS do not translate to Delta Lake, and you should consult with an experienced Delta Lake architect before partitioning tables....

March 16, 2023 · 9 min · 1803 words · Eric

Delta Lake Whitepaper  [draft]

Delta Lake builds upon standard data formats. Delta Lake table gets stored on storage in one or more data files in Apache Parquet format, along with transaction logs in JSON format. Reference Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics Michael Armburst (@databricks) Ali Ghodsi (@databricks, @uc berkeley) Reynold Xin (@databricks) Matei Zaharia (@databricks, @stanford) Michael Armburst: Boston Spark Meetup @ Wayfair / Delta Lake: Open Source Reliability and Quality for Data Lakes Delta Lake Inside YouTube: Understanding Delta File Logs - The Heart of the Delta Lake

March 12, 2023 · 1 min · 94 words · Eric

Useful Databricks References  [draft]

Documentation Azure Databricks documentation, link Databricks documentation, link Data objects in Databricks, link SQL Language Reference, link Databricks Alphabetical list of built-in functions, link Data Engineering With Databricks, GitHub Tools Databricks Utilities, link Unzip dbc files, GitHub Tutorials Data Engineering With Databricks, GitHub Databricks Training tutorials Databricks: Databricks Training, YouTube Databricks on the AWS Cloud Quick Start Reference Deployment dbt: Configure Databricks for dbt Cloud

March 8, 2023 · 1 min · 65 words · Eric

Data Engineering with Databricks v2  [draft]

00 - General Databricks documentation, link Data Engineering With Databricks, GitHub 01 - Databricks Workspace and Services Databricks Architecture and Services Databricks Control Plane Web Application Databricks SQL Databricks Machine Learning Databricks Data Science and Engineering Repos / Notebooks Job Scheduling Cluster Management Cluster Cluster are made up of one or more virtual machine (VMs) instances Driver node. Coordinate activities of executors, aka master node in EMR. Executor node. Run tasks composing a Spark job, aka run node in EMR....

March 7, 2023 · 30 min · 6268 words · Eric

Exam Guide - Databricks Certified Data Engineer Associate

General Databricks Certified Data Engineer Associate: link Time allotted to complex exam is 1.5 hours (90 minutes) Exam fee $200 USD Number of questions 45 Passing scores is at least 70% on the overall exam Code Example data manipulation code will be in SQL when possible Structured Streaming code will be in Python Runtime version is DBR 10.4 LTS Practice Exam: link Expectation Databricks Lakehouse Platform (24%). Understand how to use and the benefits of using the Databricks Lakehouse Platform and its tools....

February 20, 2023 · 7 min · 1294 words · Eric