Posts

Docker Build with Proxy

Recently I just faced some network issues when building up docker images on our AWS Bastion. The docker build command cannot pick up the environment variable for network proxy. Network Issue with Docker Build When I used curl -i https://registry.docker.io to check the network, everything looks OK. But when I was trying to build the docker image via docker build -t nba ./ and it raised the following Timeout error....

Get AWS EMR Cluster Info with Powershell

In order to get information from an existing EMR cluster, we can use 1 PS S:\ Get-EMRCluster -ClusterId $ClusterId The command will then return a system object in Amazon.ElasticMapReduce.Model.Cluster type. The Cluster object provides the following attributes that maybe useful MasterPublicDnsName. The DNS name of the master node. NormalizedInstanceHours. An approximation of the cost of the cluster. ReleaseLabel. The release label of Amazon EMR. Status. The current status details about the cluster....

EMR JobFlow Arguments Error^[draft]

I came across this error this morning with EMR and Spark steps. 1 An error occurred (ValidationException) when calling the RunJobFlow operation: 1 validation error detected: Value '[ <YOUR-SPARK-JOB> ]' at 'steps.45.member.hadoopJarStep.args' failed to satisfy constraint: Member must satisfy constraint: [Member must have length less than or equal to 10280, Member must have length greater than or equal to 0, Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]*] Or 1 botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the RunJobFlow operation: Size of step parameter length exceeded the maximum allowed....

Apache Spark Job Optimisation

Spark Job optimisation 1 spark-submit --py-files ./rs_commons_util.zip --executor-cores 4 --num-executors 4 ./main.py Reference How We Optimise Apache Spark Jobs Apache Spark: Config Cheatsheet What I Learned From Processing Big Data With Apache Spark Cloudera: How-to: Tune Your Apache Spark Jobs (Part 1) Cloudera: How-to: Tune Your Apache Spark Jobs (Part 2) Hortonworks: Spark num-executors setting Best Practices Writing Production-Grade PySpark Jobs Github: ekampf/PySpark-Boilerplate Github: snowplow/spark-example-project

Basic Usage of Pandas^[draft]

DataFrame Create a DataFrame Get DataFrame Column Headers list(df) Reference RealPython.com: Python Pandas: Tricks & Features You May Not Know TowardDataScience.com: 23 great Pandas codes for Data Scientists Analyticsvidhya.com: 12 Useful Pandas Techniques in Python for Data Manipulation