Databricks Lakehouse Fundamentals

What is a Data Lakehouse? History of Data Warehouse Pros Business Intelligence (BI) Analytics Structured & Clean Data Predefined Schemas Cons Not support semi or unstructured data Inflexible schemas Struggled with volume and velocity upticks Long processing time History of Data Lake Pros Flexible data storage Structured, semi-structured, and unstructured data Steaming support Cost efficient in the cloud Support for AI and Machine Learning Cons No transactional support Poor data reliability Data Lake are not supportive of transactional data, and cannot enforce data quality Primarily due to multiple data types Slow analysis performance Because large volume of data, the performance of analysis is slower the timeliness of decision-making results has never manifested Data governance concerns Governance over the data in a data lake creates challenges with security, and privacy enforcement due to the unstructured nature of the contents of a data lake Data Warehouse still needed Problems with Complex Data Environment Data Lake didn’t fully replaced Data Warehouse for reliable BI insights, Business has implemented complex systems to have Data Lake, Data Warehouse, and additional systems to handle streaming data, machine learning and artificial intelligence requirements....

February 8, 2023 · 7 min · 1327 words · Eric

Databricks Learning Path

Learning Path what is Databricks ( link ) what is Databricks Lakehouse ( link ) what are the ACID guarantees on Databricks ( link ) What is the medallion Lakehouse architecture ( link ) Databricks Architecture ( link ) Launching a Databricks all-purpose compute cluster ( link ) Creating a Databricks notebook ( link ) Executing notebook cells to process, query, and preview data ( link ) Create, run, and manage Databricks Jobs ( link ) Configuring incremental data ingestion to Delta Lake ( link ) Scheduling a notebook as a Databricks job ( link ) Databricks SQL ( link )

February 8, 2023 · 1 min · 101 words · Eric

Databricks Glossary

A ACL Access Control List (ACL). Auto Compaction Auto Compaction is part of the Auto Optimise feature in Databricks. It checks after an individual write, if files can further be compacted, if yes it runs an OPTIMISE job with 128MB file sizes instead of 1GB file size used in the standard OPTIMISE. Auto Compaction use 128MB for compacting files, while OPTIMISE command use 1GB. Auto Loader Auto Loader monitors a source location, in which files accumulate, to identify and ingest only new arriving files with each command run....

February 1, 2023 · 5 min · 966 words · Eric

Learn In Public

Reference Learn in Public

December 29, 2022 · 1 min · 4 words · Eric

Convert HEIC Images to JPEG with macOS

The reference post can quickly help you create an Automator workflow to convert HEIC images to JPEG. Reference How to Create a Mac Quick Action to Convert HEIC to JPG

December 21, 2022 · 1 min · 30 words · Eric