Data Engineering Project Template  [draft]

I will use it to explain some of the fundamentals that we are talking about and eventually bring them to life in a tutorial series. Will also extend the template with missing MLOps parts so tune in! Recap: Data Producers - Python Applications that extract data from chosen Data Sources and push it to Collector via REST or gRPC API calls. Collector - REST or gRPC server written in Python that takes a payload (json or protobuf), validates top level field existence and correctness, adds additional metadata and pushes the data into either Raw Events Topic if the validation passes or a Dead Letter Queue if top level fields are invalid....

January 7, 2021 · 2 min · 382 words · Eric

DataOps  [draft]

What is DataOps? DataOps is a methodology that combines technology, processes, principles, and personnel to automate data orchestration throughout an organization. Data Platform Design Data Model: Kimball Model. Data File Format Comparison: Apache Parquet, Avro, ORC, and Arrow. Open Table Formats: Delta Table, Apache Iceberg, Hudi, and Hive. Data Governance & Management Data Lifecycle Management Data Discovery & Curation Data Management & Quality Data Lineage Data Quality Uber: Data Quality at Uber - How to get data right at Uber scale DataQualityPro: Creating a Data Quality Firewall and Data Quality SLA ScenSoft: Your Guide to Data Quality Management Master Data Management Technical Capability Data Architecture Code Packaging Integration Test Monitoring & Alerting Release Management

January 7, 2021 · 1 min · 114 words · Eric

Uninstall Anaconda on macOS

Sometimes you need to re-config your local Anaconda environment, and need to uninstall Anaconda distribution completely. Automatic Uninstallation Step 1 Install the anaconda-clean package 1 conda install anaconda-clean Step 2 Clean your environment The anaconda-clean command will remove all Anaconda-related files and directories with a confirmation prompt before deleting each one. The --yes argument will help you to skip all confirmation and will remove all these files files and directories without confirmation....

October 10, 2020 · 1 min · 178 words · Eric

Pants  [draft]

Reference Pants Official document Getting started with Pants

March 22, 2019 · 1 min · 8 words · Eric

Table Lock Issues in PostgreSQL

Situation Recently we need to life and shift some datasets from AWS Redshift to AWS Aurora in daily basis. Intuitively I was thinking this progress should be very straightforward, because both Redshift and Aurora are nothing but Postgres variants, and we could utilise all the Postgres toolings (e.g., pg_dump, pg_restore, COPY etc) to transfer the data. But in reality, nothing is hard until you start to implement and write the actual code to do the work....

March 6, 2019 · 6 min · 1097 words · Eric