Useful Databricks References  [draft]

Documentation Azure Databricks documentation, link Databricks documentation, link Data objects in Databricks, link SQL Language Reference, link Databricks Alphabetical list of built-in functions, link Data Engineering With Databricks, GitHub Tools Databricks Utilities, link Unzip dbc files, GitHub Tutorials Data Engineering With Databricks, GitHub Databricks Training tutorials Databricks: Databricks Training, YouTube Databricks on the AWS Cloud Quick Start Reference Deployment dbt: Configure Databricks for dbt Cloud

March 8, 2023 · 1 min · 65 words · Eric

Data Engineering with Databricks v2  [draft]

00 - General Databricks documentation, link Data Engineering With Databricks, GitHub 01 - Databricks Workspace and Services Databricks Architecture and Services Databricks Control Plane Web Application Databricks SQL Databricks Machine Learning Databricks Data Science and Engineering Repos / Notebooks Job Scheduling Cluster Management Cluster Cluster are made up of one or more virtual machine (VMs) instances Driver node. Coordinate activities of executors, aka master node in EMR. Executor node. Run tasks composing a Spark job, aka run node in EMR....

March 7, 2023 · 30 min · 6268 words · Eric

Exam Guide - Databricks Certified Data Engineer Associate

General Databricks Certified Data Engineer Associate: link Time allotted to complex exam is 1.5 hours (90 minutes) Exam fee $200 USD Number of questions 45 Passing scores is at least 70% on the overall exam Code Example data manipulation code will be in SQL when possible Structured Streaming code will be in Python Runtime version is DBR 10.4 LTS Practice Exam: link Expectation Databricks Lakehouse Platform (24%). Understand how to use and the benefits of using the Databricks Lakehouse Platform and its tools....

February 20, 2023 · 7 min · 1294 words · Eric

Databricks Lakehouse Fundamentals

What is a Data Lakehouse? History of Data Warehouse Pros Business Intelligence (BI) Analytics Structured & Clean Data Predefined Schemas Cons Not support semi or unstructured data Inflexible schemas Struggled with volume and velocity upticks Long processing time History of Data Lake Pros Flexible data storage Structured, semi-structured, and unstructured data Steaming support Cost efficient in the cloud Support for AI and Machine Learning Cons No transactional support Poor data reliability Data Lake are not supportive of transactional data, and cannot enforce data quality Primarily due to multiple data types Slow analysis performance Because large volume of data, the performance of analysis is slower the timeliness of decision-making results has never manifested Data governance concerns Governance over the data in a data lake creates challenges with security, and privacy enforcement due to the unstructured nature of the contents of a data lake Data Warehouse still needed Problems with Complex Data Environment Data Lake didn’t fully replaced Data Warehouse for reliable BI insights, Business has implemented complex systems to have Data Lake, Data Warehouse, and additional systems to handle streaming data, machine learning and artificial intelligence requirements....

February 8, 2023 · 7 min · 1327 words · Eric

Databricks Learning Path

Learning Path what is Databricks ( link ) what is Databricks Lakehouse ( link ) what are the ACID guarantees on Databricks ( link ) What is the medallion Lakehouse architecture ( link ) Databricks Architecture ( link ) Launching a Databricks all-purpose compute cluster ( link ) Creating a Databricks notebook ( link ) Executing notebook cells to process, query, and preview data ( link ) Create, run, and manage Databricks Jobs ( link ) Configuring incremental data ingestion to Delta Lake ( link ) Scheduling a notebook as a Databricks job ( link ) Databricks SQL ( link )

February 8, 2023 · 1 min · 101 words · Eric