A Scala + Spark implementation of DBSCAN clustering algorithm

Download and environment setting

First clone locally the repository

git clone

Then move to the local repository

cd DBSCAN-distributed

In order to build a jar file to execute remotely on EMR cluster we use SBT package manager (A package manager for JAVA and SCALA like MAVEN)

To install sbt you must have JDK installed so run :


brew install openjdk

If you do not have Homebrew installed

/bin/bash -c "$(curl -fsSL"


sudo apt-get install openjdk-11-jdk

Then you can install SBT :


brew install sbt


sudo apt-get install sbt

To compile JAR locally

Read the build.sbt file to understand all about dependencies

Mainly we add the dependency for Spark with the specific version required from EMR cluster, but we are marking this dependency as "provided" because the EMR cluster has the library already installed

Then we compile a thin jar from our repository, which means that we are not including in the archive any source file from the dependencies but only our app's source files

To compile the sources

sbt compile

To create the JAR file

sbt package

To set up AWS execution

Then we only need to run our application on a EMR cluster

To do that first go to your AWS console and open the EC2 service page

Once there, from the menu on the left go to
Network & Security>Key Pairs Here you can create a pair of keys to for remote connect via ssh to an EC2 machine Follow the wizard, create the keys-pair, save you copy to your local machine
NOTE : Be aware of where the *.pem file has been saved on your machine and of course don't loose it.
Alternatively install AWS-CLI:


brew install awscli


sudo apt-get install awscli

Once done :

export KEYNAME=SomeNameForKey

#Name the file with *.pem extension
export KEYFILE=/full/path/to/new/file/containing/key.pem

aws ec2 create-key-pair --key-name $KEYNAME --query 'KeyMaterial' --output text > $KEYFILE

Now we can create our cluster Go to EMR service page. Here, from the menu on the left, click into Clusters then create a cluster. Select the number of nodes that compose your cluster, the kind of node according to your needs, choose the Spark version to execute (the one we use is Spark-2.4.7-aws from Emr-5.32.0, but you can change through build file), but most importantly choose the key-pair you just created from the security section.
With AWS-CLI :


export KEYNAME=TheKeyNameInThePreviewSection
export SUBNET_ID=TheIdOfSubnet
export EXECUTOR_SECURITY_GROUP=SecurityGroupForExecutor
export DRIVER_SECURITY_GROUP=SecurityGroupForDriver
export LOG_BUCKET=s3://S3BucketForLog
export EMR_VERSION=emr-5.32.0

export CLUSTER_NAME=SomeName


export DRIVER_INSTANCE_TYPE=t4g.nano

export REGION=us-east-1

aws emr create-cluster --applications Name=Spark Name=Zeppelin \
--ec2-attributes '{"KeyName":"${KEYNAME}","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"${SUBNET_ID}","EmrManagedSlaveSecurityGroup":"${EXECUTOR_SECURITY_GROUP}","EmrManagedMasterSecurityGroup":"${DRIVER_SECURITY_GROUP}"}' \
--service-role EMR_DefaultRole \
--enable-debugging \
--release-label $EMR_VERSION\
--log-uri '${LOG_BUCKET}'\
--name '${CLUSTER_NAME}' \
--instance-groups '[{"InstanceCount":${EXECUTOR_INSTANCE_NUM},"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"CORE","InstanceType":"${EXECUTOR_INSTANCE_TYPE}","Name":"Core Instance Group"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"MASTER","InstanceType":"${DRIVER_INSTANCE_TYPE}","Name":"Master Instance Group"}]'\
--configurations '[{"Classification":"spark","Properties":{}}]' \
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION \
--region $REGION

Now load into AWS the JAR we created previewsly. To do that go to S3 service page. Create a new bucket and upload the JAR. Let's do the same with our Dataset file. You can load it into the same S3 bucket.
Using AWS-CLI :

export S3_BUCKET_NAME=SomeName
export APP_JAR=/Path/to/JAR.jar
export DATASET_FILE=/Path/to/Dataset

aws s3 mb s3://$S3_BUCKET_NAME

aws s3 cp $APP_JAR s3://$BUCKET_NAME


Now we have all ready to run our application. Go to your local machine and open a shell. Into the shell set a variable with the path to the aforementioned *.pem file

export AWS_AUTH_KEY="path/to/key.pem"

Make sure only your user has r/w permissions to this file. So run the command

[sudo] chmod 600 $AWS_AUTH_KEY

Than we need the url of the cluster. To obtain that, go to EMR service page, select the cluster from the list, click into cluster details and from the Master public DNS voice copy the URL provided. Now from your local shell type

export AWS_MASTER_URL="Paste_here_the_url_you_just_copied"

To this point we can access the cluster driver via ssh


Finally from the remote EC2 instance shell we can run our application via

spark-submit \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.kryo.registrationRequired=true" \
--conf "spark.kryo.classesToRegister=InNode,LeafNode" \
--conf "spark.dynamicAllocation.enabled=false" \
--conf "spark.default.parallelism=8" \
--num-executors 8 \
--executor-cores 1 \
--class EntryPoint \
--data-file s3://URL_OF_DATASET \
--eps 200 \
--minc 200

Write litterally EntryPoint which is the name of an object from the JAR