databricks

Machine Learning with Databricks^[draft]

Databricks Data Engineering Professional Preparation^[draft]

01 - Modeling Data Management Solutions Bronze Ingestion Patterns Ingestion Patterns Singleplex: One-to-One mapping of source datasets to bronze tables. Multiplex: Many-to-One mapping, i.e. many datasets are mapped to one bronze table. Singleplex is the traditional ingestion model where each data source or topic is ingested separately. Singleplex usually works well for batch processing. However, for streaming processing of large datasets, if you have many streaming jobs, one per topic, you will hit the maximum limit of concurrent jobs in your workspace....

Exam Guide - Databricks Certified Data Engineer Professional

General Databricks Certified Data Engineer Professional: link Time allotted to complex exam is 2 hours (120 minutes) Exam fee $200 USD Number of questions 60 Question type: multiple choice questions Passing scores is at least 70% on the overall exam Code Example data manipulation code will be in SQL when possible Structured Streaming code will be in Python Runtime version is DBR 10.4 LTS Practice Exam: link Target Audience Data Engineer, >= 2yoe Advanced, practitioner certification Assess candidates at a level equivalent to two or more years with data engineering with Databricks Expectation Understanding of the Databricks platform and developer tools Ability to build optimised and cleaned data processing pipelines using the Spark and Delta Lake APIs Ability to model data into a Lakehouse using knowledge of general data modeling concepts Ability to make data pipelines secure, reliable, monitored, and tested before deployment Out of Scope The following is not expected of a Professional-level data engineer:...

Databricks Best Practices^[draft]

Table Partitioning You should default to non-partitioned tables for most use cases when working with Delta Lake. Most Delta Lake tables, especially small-to-medium sized data, will not benefit from partitioning. Because partitioning physically separates data files, this approach can result in a small files problem and prevent file compaction, and efficient data skipping. The benefits observed in Hive or HDFS do not translate to Delta Lake, and you should consult with an experienced Delta Lake architect before partitioning tables....

Delta Lake Whitepaper^[draft]

Delta Lake builds upon standard data formats. Delta Lake table gets stored on storage in one or more data files in Apache Parquet format, along with transaction logs in JSON format. Reference Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics Michael Armburst (@databricks) Ali Ghodsi (@databricks, @uc berkeley) Reynold Xin (@databricks) Matei Zaharia (@databricks, @stanford) Michael Armburst: Boston Spark Meetup @ Wayfair / Delta Lake: Open Source Reliability and Quality for Data Lakes Delta Lake Inside YouTube: Understanding Delta File Logs - The Heart of the Delta Lake