Delta Lake Table
You should default to non-partitioned tables for most use cases when working with Delta Lake.
- Most Delta Lake tables, especially small-to-medium sized data, will not benefit from partitioning.
- Because partitioning physically separates data files, this approach can result in a small files problem and prevent file compaction, and efficient data skipping.
- The benefits observed in Hive or HDFS do not translate to Delta Lake, and you should consult with an experienced Delta Lake architect before partitioning tables.
Table Cloning is a great way to setup tables for testing SQL code while still in development.
DEEP CLONE. Copy table metadata + data files.
- Rerun the command will sync changes from the source to target location.
SHALLOW CLONE. Just copy table metadata.
We could use
CREATE OR REPLACE AS SELECT (CRAS) statements to Complete Overwrites Delta Lake Tables have multiple benefits instead of deleting and recreating tables:
- Overwriting a table is much faster because it doesn’t need to list the directory recursively or delete all files.
- The old version of the table still exists, and it can easily retrieve the old data using Time Travel.
- It’s an atomic operation, meaning concurrent queries can still read the table while you are deleting the table.
- Due to ACID transaction guarantees, if overwriting the table fails the table will automatically roll back to its previous state.
INSERT OVERWRITE provides a nearly identical outcome as Complete Overwrite , but is a more “safer” technique for overwriting an existing table, because
- It can only overwrite an existing table, not create a new one like CRAS statement
- It can only overwrite with new records that match the current table schema, this can help us prevent disrupting downstream consumers
- It can overwrite individual partitions.
Delta Lake can improve the speed of read queries from a table. One way to improve this speed is by compacting small files into large ones. You trigger compaction by running the OPTIMIZE command.
Another way is to use
VACUUM command to automatically clean up stale files. By default,
VACUUM will prevent you from deleting files less than 7 days old, just to ensure that no long-running operations are still referencing any of the files to be deleted. But you can configure Delta Lake with following knobs to enable deleting files less than 7 days.
Delta Live Table
Create a Silver Table from a Bronze Table
In Delta Live Table, you can stream data from other tables in the same pipeline by using the
STREAM() function. In this case, we must define a streaming live table using
CREATE STREAMING LIVE TABLE statement. Remember to query another
LIVE table, we ust always use the
LIVE keyword to the table name.
Multiple Data Source
If we have scenarios where we need to ingest data from multiple tables, we could continue to use the
MERGE INTO operation, but tweak it for Insert-Only Merge for Deduplication.
This optimised command uses the same
MERGE INTO syntax but only provide a
WHEN NOT MATCHED clause.
Below we use this to confirm that records with the same
event_timestamp aren’t already in the
Single Data Source
COPY INTO command is a good friend for incrementally ingesting data from external system. It’s an idempotent option, and potentially
GLOBAL TEMP VIEW
If there are some relational object that must be used and shared with other data engineers in other sessions on the same cluster. In order to save on storage cost, avoiding copying and storing physical data, the best option to resolve this problem is to use GLOBAL TEMP VIEW in Databricks.
CREATE TABLE USING
When we use
CREATE TABLE statement, we can also specify particular data source options. The following are all the data sources that supported by Databricks
The following additional file formats to use for the table are supported in the Databricks Runtime
- a fully-qualified class name of a custom implementation of
The default option is
DELTA. For any
data_source other than
DELTA you must also specify a
LOCATION unless the table catalog is
Auto Loader VS COPY INTO
There are several things to consider when choosing between Auto Loader and COPY INTO command:
- Volumes of Data. If you are going to ingest files in the order of thousands, you can use COPY INTO command. If you are expecting files in the order of millions or more over time, the use Auto Loader.
- If your data schema is going to evolve frequently, Auto Loader provides better primitives around schema inference and evolution.
- Loading a subset of re-uploaded files can be a bit easier to manage with COPY INTO. With Auto Loader, it’s harder to reprocess a select subsets of files. However, you can use COPY INTO to reload the subset of files while an Auto Loader stream is running simultaneously.