A Scala + Spark implementation of DBSCAN clustering algorithm
First clone locally the repository
git clone https://github.com/AlecioP/DBSCAN-distributed
Then move to the local repository
cd DBSCAN-distributed
In order to build a jar file to execute remotely on EMR cluster we use SBT package manager (A package manager for JAVA and SCALA like MAVEN)
To install sbt you must have JDK installed so run :
MACOS
brew install openjdk
If you do not have Homebrew installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
UBUNTU
sudo apt-get install openjdk-11-jdk
Then you can install SBT :
MACOS
brew install sbt
UBUNTU
sudo apt-get install sbt
Read the build.sbt file to understand all about dependencies
Mainly we add the dependency for Spark with the specific version required from EMR cluster, but we are marking this dependency as "provided" because the EMR cluster has the library already installed
Then we compile a thin jar from our repository, which means that we are not including in the archive any source file from the dependencies but only our app's source files
To compile the sources
sbt compile
To create the JAR file
sbt package
Then we only need to run our application on a EMR cluster
To do that first go to your AWS console and open the EC2 service page
Once there, from the menu on the left go to
Network & Security>Key Pairs
Here you can create a pair of keys to for remote connect via ssh to an EC2 machine
Follow the wizard, create the keys-pair, save you copy to your local machine
NOTE : Be aware of where the *.pem file has been saved on your machine and of course don't loose it.
Alternatively install AWS-CLI
:
MACOS
brew install awscli
UBUNTU
sudo apt-get install awscli
Once done :
export KEYNAME=SomeNameForKey
#Name the file with *.pem extension
export KEYFILE=/full/path/to/new/file/containing/key.pem
aws ec2 create-key-pair --key-name $KEYNAME --query 'KeyMaterial' --output text > $KEYFILE
Now we can create our cluster
Go to EMR service page.
Here, from the menu on the left, click into Clusters then create a cluster.
Select the number of nodes that compose your cluster, the kind of node according to your needs, choose the Spark version to execute (the one we use is Spark-2.4.7-aws from Emr-5.32.0, but you can change through build file), but most importantly choose the key-pair you just created from the security section.
With AWS-CLI
:
#CUSTOMIZE ALL THESE PARAMETERS
export KEYNAME=TheKeyNameInThePreviewSection
export SUBNET_ID=TheIdOfSubnet
export EXECUTOR_SECURITY_GROUP=SecurityGroupForExecutor
export DRIVER_SECURITY_GROUP=SecurityGroupForDriver
export LOG_BUCKET=s3://S3BucketForLog
export EMR_VERSION=emr-5.32.0
export CLUSTER_NAME=SomeName
export EXECUTOR_INSTANCE_TYPE=t4g.nano
export EXECUTOR_INSTANCE_NUM=2
export DRIVER_INSTANCE_TYPE=t4g.nano
export REGION=us-east-1
aws emr create-cluster --applications Name=Spark Name=Zeppelin \
--ec2-attributes '{"KeyName":"${KEYNAME}","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"${SUBNET_ID}","EmrManagedSlaveSecurityGroup":"${EXECUTOR_SECURITY_GROUP}","EmrManagedMasterSecurityGroup":"${DRIVER_SECURITY_GROUP}"}' \
--service-role EMR_DefaultRole \
--enable-debugging \
--release-label $EMR_VERSION\
--log-uri '${LOG_BUCKET}'\
--name '${CLUSTER_NAME}' \
--instance-groups '[{"InstanceCount":${EXECUTOR_INSTANCE_NUM},"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"CORE","InstanceType":"${EXECUTOR_INSTANCE_TYPE}","Name":"Core Instance Group"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"MASTER","InstanceType":"${DRIVER_INSTANCE_TYPE}","Name":"Master Instance Group"}]'\
--configurations '[{"Classification":"spark","Properties":{}}]' \
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION \
--region $REGION
Now load into AWS the JAR we created previewsly. To do that go to S3 service page.
Create a new bucket and upload the JAR. Let's do the same with our Dataset file. You can load it into the same S3 bucket.
Using AWS-CLI
:
export S3_BUCKET_NAME=SomeName
export APP_JAR=/Path/to/JAR.jar
export DATASET_FILE=/Path/to/Dataset
aws s3 mb s3://$S3_BUCKET_NAME
aws s3 cp $APP_JAR s3://$BUCKET_NAME
aws s3 cp $DATASET_FILE s3://$BUCKET_NAME
Now we have all ready to run our application. Go to your local machine and open a shell. Into the shell set a variable with the path to the aforementioned *.pem file
export AWS_AUTH_KEY="path/to/key.pem"
Make sure only your user has r/w permissions to this file. So run the command
[sudo] chmod 600 $AWS_AUTH_KEY
Than we need the url of the cluster. To obtain that, go to EMR service page, select the cluster from the list, click into cluster details and from the Master public DNS voice copy the URL provided. Now from your local shell type
export AWS_MASTER_URL="Paste_here_the_url_you_just_copied"
To this point we can access the cluster driver via ssh
ssh -i $AWS_AUTH_KEY $AWS_MASTER_URL
Finally from the remote EC2 instance shell we can run our application via
spark-submit \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.kryo.registrationRequired=true" \
--conf "spark.kryo.classesToRegister=InNode,LeafNode" \
--conf "spark.dynamicAllocation.enabled=false" \
--conf "spark.default.parallelism=8" \
--num-executors 8 \
--executor-cores 1 \
--class EntryPoint \
s3://URL_OF_JAR_WITHIN_S3 \
--data-file s3://URL_OF_DATASET \
--eps 200 \
--minc 200
Write litterally EntryPoint which is the name of an object from the JAR