Apache Spark Job Optimisation

Spark Job optimisation 1 spark-submit --py-files ./rs_commons_util.zip --executor-cores 4 --num-executors 4 ./main.py Reference How We Optimise Apache Spark Jobs Apache Spark: Config Cheatsheet What I Learned From Processing Big Data With Apache Spark Cloudera: How-to: Tune Your Apache Spark Jobs (Part 1) Cloudera: How-to: Tune Your Apache Spark Jobs (Part 2) Hortonworks: Spark num-executors setting Best Practices Writing Production-Grade PySpark Jobs Github: ekampf/PySpark-Boilerplate Github: snowplow/spark-example-project

October 28, 2018 · 1 min · 65 words · Eric

System Design Note II - Performance Metrics

Performance Metrics Scalability Ability of a system to grow and manage increased traffic. Increased volume of data or requests. Our goal is we want to achieve this growth without a lose in performance. Bad system design could result in a bottleneck on the number of users or traffic our application can handle, or could result in exponentially increasing cost to server a small increased traffic. Reliability Probability a system will fail during a period of time....

March 4, 2017 · 3 min · 550 words · Eric

Hive Query Performance Tuning

There are several parameters that we can tune in Hive to improve the overall query performance. For instance, 1 2 3 4 5 6 7 8 9 10 11 12 -- refers to http://hortonworks.com/community/forums/topic/mapjoinmemoryexhaustionexception-on-local-job/ -- before running your query to disable local in-memory joins and force the join to be done as a distributed Map-Reduce phase. -- After running your query you should set the value back to true with: set hive....

May 11, 2015 · 1 min · 108 words · Eric

SQL Query Performance Optimisation

Try not to use SELECT * and provide specific fields Bad Example 1 SELECT * FROM user; Good Example 1 SELECT id, username, tel FROM user; Reason Specify fields will save computational resources, and reduce network expense. It’s possible using pre-built table index to reduce table returns, and improve query efficiency. Avoid using OR in the where clause to connect conditions Bad Example 1 2 SELECT id, username, tel FROM user WHERE id=1 OR salary=500; Good Example Use UNION ALL to concatenate results 1 2 3 4 5 SELECT id, username, tel FROM user WHERE id=1 UNION ALL SELECT id, username, tel FROM user WHERE salary=500; Use two separate SQL 1 2 SELECT id, username, tel FROM user WHERE id=1 1 2 SELECT id, username, tel FROM user WHERE salary=500; Reason The use of OR may invalidate the table index, and thus will do full table scan....

August 11, 2013 · 4 min · 673 words · Eric