Spark Standalone in AWS

Daehwan Bae
2 min readJan 5, 2020

--

Spark is also good tool for processing small size job.

Prepare docker image

from alpine(need to install more packages) or ubuntu iamge.

FROM openjdk:8-alpineMAINTAINER xxx@mailRUN apk add --update python2 python3 bash coreutils procps gcompat# Install your python library
RUN python3 -m pip install bs4
WORKDIR /app/serviceRUN wget http://apache.mirror.cdnetworks.com/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz && tar xfz spark-2.4.0-bin-hadoop2.7.tgz && ln -s spark-2.4.0-bin-hadoop2.7 sparkRUN rm -f *.tar.gz *.tgzENV SPARK_HOME=/app/service/sparkENV SPARK_LOG_DIR=/app/logs/sparkENV SPARK_LOCAL_DIRS=/app/tmpENV PATH=$SPARK_HOME/sbin:$PATHENV SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dcom.amazonaws.services.s3.enforceV4=true"

Build Dockerfile

docker build -t your_docker_hub_address/spark:2.4.0 .docker push your_docker_hub_address/spark:2.4.0

Start standalone-master node in 1 node

docker run -d --net=host -v /app/logs:/app/logs -v /app/tmp:/app/tmp your_docker_hub_address/spark:2.4.0  /bin/bash -c "SPARK_NO_DAEMONIZE= start-master.sh"

Start standalone-slave nodes using ASG

In my case, I put this to AWS-AutoScalingGroup(user-data section). So, I can increase ASG size anytime.

CPU=$(cat /proc/cpuinfo | grep processor | wc -l)docker run -d --net=host -v /app/logs:/app/logs -v /app/tmp:/app/tmp your_docker_hub_address/spark:2.4.0\/bin/bash -c "SPARK_NO_DAEMONIZE= start-slave.sh spark://master_ip_address:7077 -c $(($CPU -1))"

Set-up standalone-driver node

In my case, I set-up spark driver in Jenkins Slave node.

Download below 2 files to folder (ex: /opt/spark/jars)

To access S3, version is important.

wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jarwget http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar

Edit conf/spark-defaults.conf

  • com.amazonaws.services.s3.enforceV4 : because aws library version is too old. Korean users have to add this property.
  • For Hadoop, uploading files are write to tmp directory and directory names are replaced. But S3, it based on object.(prefixes cannot be changed.)
spark.master                     spark://MASTER_URL:7077spark.driver.core                1spark.driver.memory              2gspark.driver.extraJavaOptions    -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Dcom.amazonaws.services.s3.enforceV4=truespark.executor.extraJavaOptions  -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Dcom.amazonaws.services.s3.enforceV4=truespark.executor.memory 50gspark.jars      /opt/spark/jars/aws-java-sdk-1.7.4.jar,/opt/spark/jars/hadoop-aws-2.7.3.jarspark.jars.packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.3,org.mongodb:mongo-java-driver:3.8.0,org.elasticsearch:elasticsearch-spark-20_2.11:6.2.4spark.ui.showConsoleProgress    truespark.speculation   falsespark.serializer    org.apache.spark.serializer.KryoSerializerspark.hadoop.mapreduce.fileoutputcommitter.algorithm.version    2spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped  truespark.hadoop.parquet.enable.summary-metadata    falsespark.hadoop.fs.s3a.endpoint    s3.ap-northeast-2.amazonaws.comspark.hadoop.fs.s3a.impl    org.apache.hadoop.fs.s3a.S3AFileSystemspark.hadoop.fs.s3n.impl    org.apache.hadoop.fs.s3a.S3AFileSystemspark.hadoop.fs.s3.impl     org.apache.hadoop.fs.s3a.S3AFileSystemspark.hadoop.fs.s3a.experimental.input.fadvise  randomspark.hadoop.fs.s3a.fast.upload     truespark.hadoop.fs.s3a.fast.upload.active.blocks   8

--

--

No responses yet