Delta Lake Table

Table Partitioning

You should default to non-partitioned tables for most use cases when working with Delta Lake.

  • Most Delta Lake tables, especially small-to-medium sized data, will not benefit from partitioning.
  • Because partitioning physically separates data files, this approach can result in a small files problem and prevent file compaction, and efficient data skipping.
  • The benefits observed in Hive or HDFS do not translate to Delta Lake, and you should consult with an experienced Delta Lake architect before partitioning tables.

Table Cloning

Table Cloning is a great way to setup tables for testing SQL code while still in development.

  • DEEP CLONE. Copy table metadata + data files.
    • Rerun the command will sync changes from the source to target location.
  • SHALLOW CLONE. Just copy table metadata.

Table Overwriting

We could use CREATE OR REPLACE AS SELECT (CRAS) statements to Complete Overwrites Delta Lake Tables have multiple benefits instead of deleting and recreating tables:

  1. Overwriting a table is much faster because it doesn’t need to list the directory recursively or delete all files.
  2. The old version of the table still exists, and it can easily retrieve the old data using Time Travel.
  3. It’s an atomic operation, meaning concurrent queries can still read the table while you are deleting the table.
  4. Due to ACID transaction guarantees, if overwriting the table fails the table will automatically roll back to its previous state.

INSERT OVERWRITE provides a nearly identical outcome as Complete Overwrite , but is a more “safer” technique for overwriting an existing table, because

  1. It can only overwrite an existing table, not create a new one like CRAS statement
  2. It can only overwrite with new records that match the current table schema, this can help us prevent disrupting downstream consumers
  3. It can overwrite individual partitions.

Performance Optimisation


Delta Lake can improve the speed of read queries from a table. One way to improve this speed is by compacting small files into large ones. You trigger compaction by running the OPTIMIZE command.


Another way is to use VACUUM command to automatically clean up stale files. By default, VACUUM will prevent you from deleting files less than 7 days old, just to ensure that no long-running operations are still referencing any of the files to be deleted. But you can configure Delta Lake with following knobs to enable deleting files less than 7 days.

Delta Live Table

Create a Silver Table from a Bronze Table

In Delta Live Table, you can stream data from other tables in the same pipeline by using the STREAM() function. In this case, we must define a streaming live table using CREATE STREAMING LIVE TABLE statement. Remember to query another LIVE table, we ust always use the LIVE keyword to the table name.

  store_id, total + tax AS total_after_tax
FROM STREAM(LIVE.sales_bronze)

Data Ingestion

Multiple Data Source

If we have scenarios where we need to ingest data from multiple tables, we could continue to use the MERGE INTO operation, but tweak it for Insert-Only Merge for Deduplication.

This optimised command uses the same MERGE INTO syntax but only provide a WHEN NOT MATCHED clause.

Below we use this to confirm that records with the same user_id and event_timestamp aren’t already in the events table.

MERGE INTO events a
USING events_update b
ON a.user_id = b.user_id AND a.event_timestamp = b.event_timestamp
WHEN NOT MATCHED AND b.traffic_source = 'email' THEN 

Single Data Source

COPY INTO command is a good friend for incrementally ingesting data from external system. It’s an idempotent option, and potentially



If there are some relational object that must be used and shared with other data engineers in other sessions on the same cluster. In order to save on storage cost, avoiding copying and storing physical data, the best option to resolve this problem is to use GLOBAL TEMP VIEW in Databricks.


When we use CREATE TABLE statement, we can also specify particular data source options. The following are all the data sources that supported by Databricks

  1. TEXT
  2. AVRO
  4. CSV
  5. JSON
  7. ORC
  8. DELTA

The following additional file formats to use for the table are supported in the Databricks Runtime

  • JDBC
  • a fully-qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister.

The default option is DELTA. For any data_source other than DELTA you must also specify a LOCATION unless the table catalog is hive_metastore.

Auto Loader VS COPY INTO

There are several things to consider when choosing between Auto Loader and COPY INTO command:

  1. Volumes of Data. If you are going to ingest files in the order of thousands, you can use COPY INTO command. If you are expecting files in the order of millions or more over time, the use Auto Loader.
  2. If your data schema is going to evolve frequently, Auto Loader provides better primitives around schema inference and evolution.
  3. Loading a subset of re-uploaded files can be a bit easier to manage with COPY INTO. With Auto Loader, it’s harder to reprocess a select subsets of files. However, you can use COPY INTO to reload the subset of files while an Auto Loader stream is running simultaneously.