A
ACL
Access Control List (ACL).
Auto Compaction
Auto Compaction is part of the Auto Optimise feature in Databricks. It checks after an individual write, if files can further be compacted, if yes it runs an OPTIMISE job with 128MB file sizes instead of 1GB file size used in the standard OPTIMISE.
Auto Compaction use 128MB for compacting files, while OPTIMISE command use 1GB.
Auto Loader
Auto Loader monitors a source location, in which files accumulate, to identify and ingest only new arriving files with each command run. While the files that have already been ingested in previous runs are skipped.
Auto Loader incrementally and idempotently processes new data files as they arrive in cloud storage and load them into a target Delta Lake table.
Auto Optimise
Auto Optimise is a functionality that allows Delta Lake to automatically compact small data files of Delta tables. This can be achieved during individual writes to the Delta table.
Auto Optimise consists of 2 complementary operations:
- Optimise writes. with this feature enabled, Databricks attempts to write out 128MB files of each table partition.
- Auto compaction . This will check after an individual write, if files can further be compacted. If yes, then runs an OPTIMISE job with 128MB file size (instead of the 1GB file size used in the standard OPTIMISE). Auto compaction does not support Z-Ordering as Z-Ordering is significantly more expensive than just compaction.
Autotune File Size
To minimise the need for manual tuning, Databricks automatically tunes the file size of Delta tables based on the size of the table. Databricks will use smaller file sizes for smaller tables and larger file sizes for larger tables, so that the number of files in the table does not grow too large. Databricks does not autotune tables that you have tuned with a specific target size or based on a workload with frequent rewrites.
The target file size is based on the current size of the Delta table.
- For tables smaller than 2.56TB, the autotuned target file size is
256MB
. - For tables with a size between 2.56TB and 10TB, the target size will growth linearly from
256MB
to1GB
. - For tables larger than 10TB, the target file size is
1GB
.
B
C
Cache
Change Data Feed
Change Data Feed (CDF) , is a new feature built into Delta Lake that allows it to automatically generate CDC feeds about Delta Lake tables.
CDF records row-level changes for all the data written into a Delta Lake. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated.
Databricks supports reading table’s changes captured by CDF in streaming queries using spark.readStream
. This allows you to get only the new changes captured since the last time the streaming query was run.
Clone
Clone See the following for more details
D
Deep Clone
A Deep Clone is a clone that copies the source table data to the clone target in addition to the metadata of the existing table. Additionally, stream metadata is also cloned such that a stream that writes to the Delta table can be stopped on a source table and continued on the target of a clone from where it left off.
Delta Cache
See the following for more details
Delta Log
See the following for more details
Disk Cache
Disk cache, previously known as Delta cache, is designed to enhance query performance by storing data on disk, allowing for accelerated data reads. Data is automatically cached when files are fetched, utilizing a fast intermediate format.
E
F
G
Global Init Script
H
I
J
K
L
M
MERGE INTO
N
O
Optimise
Optimise
Delta Lake can improve the speed of read queries from a table. One way to improve this speed is by compacting small files into large ones. You trigger compaction by running the OPTIMISE
command.
See the following for more details
P
Q
R
S
Schema Evolution
Service Principle
Think about it like AWS IAM roles.
Shallow Clone
A Shallow Clone is a clone that does not copy the data files to the clone target. The table metadata is equivalent to the source. These clones are cheaper to create.
SQL Endpoint
It’s just a different name of a serverless Data Warehouse in Databricks.
SQL-OPTIMISE
Stream-Static Join
Stream-Static joins take advantage of Delta Lake guarantee that the latest version of the static delta data is returned each time it is queried in a join operation with a data stream.
Stream-Stream Join
When performing stream-stream join, Spark buffers past inputs as a streaming state for both input streams, so that it can match every future input with past inputs. This state can be limited by using watermarks.
T
Table History
Table History Each operation that modifies a Delta Lake table creates a new table version.
Transaction Log
Time Travel
Delta Lake Time Travel supports querying previous table versions based on timestamp or table version (as recorded in the transaction log). You can use time travel for application such as
- Re-creating analyses, reports, or outputs. This could be useful for debugging or auditing, especially in regulated industries, e.g. Banking, Insurance, Government.
- Write complex temporal queries.
- Fixing mistakes in your data.
- Providing snapshot isolation for a set of queries for fast changing tables.
U
UPSERT
See the following for more details
V
Version
See the following for more details
W
X
Y
Z
Z-Ordering
Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behaviour dramatically reduces the amount of data that Delta Lake on Databricks needs to read.
See the following for similar concept