Table Lock Issues in PostgreSQL

Situation Recently we need to life and shift some datasets from AWS Redshift to AWS Aurora in daily basis. Intuitively I was thinking this progress should be very straightforward, because both Redshift and Aurora are nothing but Postgres variants, and we could utilise all the Postgres toolings (e.g., pg_dump, pg_restore, COPY etc) to transfer the data. But in reality, nothing is hard until you start to implement and write the actual code to do the work....

March 6, 2019 · 6 min · 1097 words · Eric

Pandas Best Practice

Data Manipulation Dedup DataFrame Sometimes we want to drop all the duplicated data in our DataFrame, and we can use the drop_duplicates() function. For Example: 1 2 3 4 5 6 7 8 9 10 11 12 df = pd.DataFrame({ 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], 'rating': [4, 4, 3.5, 15, 5] }) df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4....

March 1, 2019 · 2 min · 330 words · Eric

PostgreSQL Best Practice

Table Creation Add essential field checking rules when creating table 1 2 3 4 5 6 7 8 9 10 CREATE TABLE IF NOT EXISTS time ( start_time TIMESTAMP CONSTRAINT time_pk PRIMARY KEY, hour INT NOT NULL CHECK (hour >= 0), day INT NOT NULL CHECK (day >= 0), week INT NOT NULL CHECK (week >= 0), month INT NOT NULL CHECK (month >= 0), year INT NOT NULL CHECK (year >= 0), weekday VARCHAR NOT NULL ); Column Referencing If you already knew there are some foreign key referencing acrossing different tables, you can specify that when creating your table....

March 1, 2019 · 2 min · 299 words · Eric

Apache Spark Job Optimisation

Spark Job optimisation 1 spark-submit --py-files ./rs_commons_util.zip --executor-cores 4 --num-executors 4 ./main.py Reference How We Optimise Apache Spark Jobs Apache Spark: Config Cheatsheet What I Learned From Processing Big Data With Apache Spark Cloudera: How-to: Tune Your Apache Spark Jobs (Part 1) Cloudera: How-to: Tune Your Apache Spark Jobs (Part 2) Hortonworks: Spark num-executors setting Best Practices Writing Production-Grade PySpark Jobs Github: ekampf/PySpark-Boilerplate Github: snowplow/spark-example-project

October 28, 2018 · 1 min · 65 words · Eric

Basic Usage of Pandas  [draft]

DataFrame Create a DataFrame Get DataFrame Column Headers list(df) Reference RealPython.com: Python Pandas: Tricks & Features You May Not Know TowardDataScience.com: 23 great Pandas codes for Data Scientists Analyticsvidhya.com: 12 Useful Pandas Techniques in Python for Data Manipulation

October 6, 2018 · 1 min · 38 words · Eric