01 - Modeling Data Management Solutions

Bronze Ingestion Patterns

Ingestion Patterns

  • Singleplex: One-to-One mapping of source datasets to bronze tables.
  • Multiplex: Many-to-One mapping, i.e. many datasets are mapped to one bronze table.

Singleplex is the traditional ingestion model where each data source or topic is ingested separately. Singleplex usually works well for batch processing. singleplex pattern

However, for streaming processing of large datasets, if you have many streaming jobs, one per topic, you will hit the maximum limit of concurrent jobs in your workspace.

Multiplex, on the other hand, combines many topics and stream them into a single bronze table. In this model, we typically use a pub/sub system such as Kafka as a source, but we can also use files and cloud objects with Auto Loader. multiplex pattern

In Multiplex pattern, the records are organised into topics along with the value columns that contains the actual data in JSON format. Later in the pipeline, the multiplex bronze table will be filtered based on the topic column to create the silver layer tables. multiplex bronze to silver

Quality Enforcement

1
ALERT TABLE orders_silver ADD CONSTRAINT

Slowly Changing Dimensions (SCD)

Change Data Capture

Delta Lake CDF

Change Data Feed (CDF) is a new feature built-in Delta Lake that allows to automatically generate CDC feeds about Delta Lake tables.

CDF records row-level changes for all the data written into a delta table. These include the raw data along with metadata indicating whether the specified row was inserted, deleted, or updated.

cdf multi-hop CDF is used to propagate incremental changes to downstream tables in a multi-hop architecture.

Query the change data

To query the change data, we can use the follow sql command

1
2
SELECT *
FROM table_changes("table_name", start_version, [end_version])

Or

1
2
SELECT *
FROM table_changes ("table_name", start_timestamp, [end_timestamp])

Enable CDF

CDF is not been enabled by default, you can enable CDF via following commands

New tables

1
2
CREATE TABLE my_table (id INT, name STRING)
TBLPROPERTIES (delta.enableChangeDataFeed = true)

Existing table

1
2
ALTER TABLE my_table
SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

You can also use this spark configuration in your notebook

1
spark.databricks.delta.properties.defaults.enableChangeDataFeed

CDF Retention

  • Follow the retention policy of the table.
  • When running VACUUM, CDF data is also deleted.

CDF Usage Scenario

  • Use CDF when
    • Generally speaking, we use CDF for sending incremental data changes to downstream tables in a multi-hop architecture. These changes should include updates and deletes.
    • Use CDF when only a small fraction of records updated in each batch. Such updates usually received from external sources in CDC format.
  • Don’t use CDF when
    • If these changes are append-only, then there is no need to use CDF. We can directly stream these changes from the table.
    • If most of the records in the table are updated or if the table is completely overwritten in each batch.

Stream-Stream Joins

Stream-Static Joins

Materialised Gold Tables

02 - Databricks Tooling (20%, 12/60)

Databricks Workflow

RESTful API

Databricks CLI

The Databricks CLI is simply a Pytohn wrapper of above RESTful API.

To install the Databricks CLI, you need to run

1
pip install databricks-cli

To configure your Databricks CLI, you need to run

1
databricks configure --token

Once you setup your local environment, you can try

03 - Data Processing (30%, 18/60)

04 - Data Modeling (20%, 12/60)

05 - Security & Governance (10%, 6/60)

Propagating Deletes

Delete Requests are also known as Request to be Forgotten. They require deleting user data that represent Personally Identifiable Information (PII), such as name, email of the user.

Dynamics Views

Dynamics views allow identity ACL access control list to be applied to data in a table at the column or row level. So user with sufficient privileges will be able to see all the fields, while restricted users will be shown arbitrary results as defined at view creation.

Column level control

For example, in the following example, only members who are in the admin_demo group have permissions to view PII information, other members will only see redacted results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
CREATE OR REPLACE VIEW customers_vw AS
  SELECT
    customer_id,
    CASE
      WHEN is_member('admin_demo') THEN EMAIL
      ELSE 'REDACTED'
    END AS email,
    gender,
    CASE
      WHEN is_member('admin_demo') THEN first_name
      ELSE 'REDACTED'
    END AS first_name,
    CASE
      WHEN is_member('admin_demo') THEN last_name
      ELSE 'REDACTED'
    END AS last_name,
    CASE
      WHEN is_member('admin_demo') THEN street
      ELSE 'REDACTED'
    END AS street,
    city,
    country,
    row_time
  FROM customers_silver

For users in admin_demo group, dynamics view with privilege

For other users dynamic view no privilege

Row level control

For row level control, we can add WHERE CLAUSE to filter source data on different conditions. Note we can create views on top of another view, and permissions will inherit from previous one.

1
2
3
4
5
6
7
CREATE OR REPLACE VIEW customers_fr_vw AS
SELECT * FROM customers_vw
WHERE
  CASE
    WHEN is_member("admin_demo") THEN TRUE
    ELSE country = "France" AND row_time > "2022-01-01"
  END

Here is an example user who is not in the admin_demo group. dynamic view no privilege

06 - Monitoring & Logging (10%, 6/60)

07 - Testing & Deployment (10%, 6/60)

Data Pipeline Testing

  • Data Quality test: test the quality of the data
    • e.g. You can test that a column price has only values greater than zero.
    • Apply check constraints to Delta tables
  • Standard tests: test the code logic
    • Unit testing
    • Integration testing
    • End-to-end testing

Unit Testing

  • Approach to testing individual units of code, such as functions
  • If you make any changes to them in the future, you can determine whether they still work as you expect them to
  • This helps you find problems with your code faster and earlier in hte development lifecycle.
  • Use assertion to perform unit test.
    • An assertion is a statement that enables you to test the assumptions you have made in your code.
1
assert func() == expected_value

Integration Testing

  • Approach to testing the interaction between subsystems of an application.
  • Software modules are integrated logically and tested as a group.

End-to-end Testing

  • Approach to ensure that your application can run properly under real-world scenarios
  • Simulate a user experience from start to finish

Reference