Spark Standalone in AWS
2 min readJan 5, 2020
Spark is also good tool for processing small size job.
Prepare docker image
from alpine(need to install more packages) or ubuntu iamge.
FROM openjdk:8-alpineMAINTAINER xxx@mailRUN apk add --update python2 python3 bash coreutils procps gcompat# Install your python library
RUN python3 -m pip install bs4WORKDIR /app/serviceRUN wget http://apache.mirror.cdnetworks.com/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz && tar xfz spark-2.4.0-bin-hadoop2.7.tgz && ln -s spark-2.4.0-bin-hadoop2.7 sparkRUN rm -f *.tar.gz *.tgzENV SPARK_HOME=/app/service/sparkENV SPARK_LOG_DIR=/app/logs/sparkENV SPARK_LOCAL_DIRS=/app/tmpENV PATH=$SPARK_HOME/sbin:$PATHENV SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dcom.amazonaws.services.s3.enforceV4=true"
Build Dockerfile
docker build -t your_docker_hub_address/spark:2.4.0 .docker push your_docker_hub_address/spark:2.4.0
Start standalone-master node in 1 node
docker run -d --net=host -v /app/logs:/app/logs -v /app/tmp:/app/tmp your_docker_hub_address/spark:2.4.0 /bin/bash -c "SPARK_NO_DAEMONIZE= start-master.sh"
Start standalone-slave nodes using ASG
In my case, I put this to AWS-AutoScalingGroup(user-data section). So, I can increase ASG size anytime.
CPU=$(cat /proc/cpuinfo | grep processor | wc -l)docker run -d --net=host -v /app/logs:/app/logs -v /app/tmp:/app/tmp your_docker_hub_address/spark:2.4.0\/bin/bash -c "SPARK_NO_DAEMONIZE= start-slave.sh spark://master_ip_address:7077 -c $(($CPU -1))"
Set-up standalone-driver node
In my case, I set-up spark driver in Jenkins Slave node.
Download below 2 files to folder (ex: /opt/spark/jars)
To access S3, version is important.
wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jarwget http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
Edit conf/spark-defaults.conf
- com.amazonaws.services.s3.enforceV4 : because aws library version is too old. Korean users have to add this property.
- For Hadoop, uploading files are write to tmp directory and directory names are replaced. But S3, it based on object.(prefixes cannot be changed.)
spark.master spark://MASTER_URL:7077spark.driver.core 1spark.driver.memory 2gspark.driver.extraJavaOptions -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Dcom.amazonaws.services.s3.enforceV4=truespark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Dcom.amazonaws.services.s3.enforceV4=truespark.executor.memory 50gspark.jars /opt/spark/jars/aws-java-sdk-1.7.4.jar,/opt/spark/jars/hadoop-aws-2.7.3.jarspark.jars.packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.3,org.mongodb:mongo-java-driver:3.8.0,org.elasticsearch:elasticsearch-spark-20_2.11:6.2.4spark.ui.showConsoleProgress truespark.speculation falsespark.serializer org.apache.spark.serializer.KryoSerializerspark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped truespark.hadoop.parquet.enable.summary-metadata falsespark.hadoop.fs.s3a.endpoint s3.ap-northeast-2.amazonaws.comspark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystemspark.hadoop.fs.s3n.impl org.apache.hadoop.fs.s3a.S3AFileSystemspark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystemspark.hadoop.fs.s3a.experimental.input.fadvise randomspark.hadoop.fs.s3a.fast.upload truespark.hadoop.fs.s3a.fast.upload.active.blocks 8