Installing a Hadoop cluster with Cloudera Manager

Assumptions

instances for each node type are up and accessible (ssh)
a supported OS is installed on each instance
a DNS server is available on the network and hostnames are resolvable
instances are on the same private network
all commands will be run as root user unless specified otherwise

If you are missing anything then maybe check the Cloud Setup documentation.

Official installation paths

Path A: One-stop binary installer
- Useful for short-term, throwaway projects
- Relies on embedded, hard-configured PostgreSQL server
Path B: Install CM and database manually
- Any cluster standing for more than 3-6 months
- Can use Oracle, MySQL, MariaDB, or PostgreSQL server
- Can deploy CDH as Linux packages or parcels
Path C: Tarballs
- DIY-oriented
- Useful with other deployment tools (Chef, Puppet)

This workshop follows path B.

Preinstall

Package updates and some handy tools

Update the nodes and add some missing tools [all nodes]

yum clean all
yum -y update
yum install -y wget vim mlocate net-tools nc lsof bind-utils nscd ntp

Mounting additional disks

To get the most out of your cluster, having enough disks is very important. In production you want disks with heavy I/O requirements to be dedicated to whatever their main usage is.

HDFS Data disks

mount with noatime option
reduce the number of reserved blocks
create /mnt/data on master nodes as well
a good ratio is to have the same amount of disks as CPU cores
noatime disables writing access timestamps to disk

workshop example for single worker node:

mkfs.ext4 -m 0 -F /dev/disk/by-id/scsi-0DO_Volume_data0
tune2fs -m 0 /dev/disk/by-id/scsi-0DO_Volume_data0
sudo mkdir -p /mnt/data; sudo mount -o discard,defaults,noatime /dev/disk/by-id/scsi-0DO_Volume_data0 /mnt/data; echo /dev/disk/by-id/scsi-0DO_Volume_data0 /mnt/data ext4 defaults,nofail,discard,noatime 0 0 | sudo tee -a /etc/fstab

mkfs.ext4 -m 0 -F /dev/disk/by-id/scsi-0DO_Volume_data1
tune2fs -m 0 /dev/disk/by-id/scsi-0DO_Volume_data1
sudo mkdir -p /mnt/data; sudo mount -o discard,defaults,noatime /dev/disk/by-id/scsi-0DO_Volume_data1 /mnt/data; echo /dev/disk/by-id/scsi-0DO_Volume_data1 /mnt/data ext4 defaults,nofail,discard,noatime 0 0 | sudo tee -a /etc/fstab

mkfs.ext4 -m 0 -F /dev/disk/by-id/scsi-0DO_Volume_data2
tune2fs -m 0 /dev/disk/by-id/scsi-0DO_Volume_data2
sudo mkdir -p /mnt/data; sudo mount -o discard,defaults,noatime /dev/disk/by-id/scsi-0DO_Volume_data2 /mnt/data; echo /dev/disk/by-id/scsi-0DO_Volume_data2 /mnt/data ext4 defaults,nofail,discard,noatime 0 0 | sudo tee -a /etc/fstab

Zookeeper disks

choose disks with higher I/O throughput

workshop example for a single master node:

mkfs.ext4 -m 0 -F /dev/disk/by-id/scsi-0DO_Volume_zoo0
tune2fs -m 0 /dev/disk/by-id/scsi-0DO_Volume_zoo0
sudo mkdir -p /mnt/zoo; sudo mkdir -p /mnt/data; sudo mount -o discard,defaults /dev/disk/by-id/scsi-0DO_Volume_zoo0 /mnt/zoo; echo /dev/disk/by-id/scsi-0DO_Volume_zoo0 /mnt/zoo ext4 defaults,nofail,discard 0 0 | sudo tee -a /etc/fstab

mkfs.ext4 -m 0 -F /dev/disk/by-id/scsi-0DO_Volume_zoo1
tune2fs -m 0 /dev/disk/by-id/scsi-0DO_Volume_zoo1
sudo mkdir -p /mnt/zoo; sudo mkdir -p /mnt/data; sudo mount -o discard,defaults /dev/disk/by-id/scsi-0DO_Volume_zoo1 /mnt/zoo; echo /dev/disk/by-id/scsi-0DO_Volume_zoo1 /mnt/zoo ext4 defaults,nofail,discard 0 0 | sudo tee -a /etc/fstab

mkfs.ext4 -m 0 -F /dev/disk/by-id/scsi-0DO_Volume_zoo2
tune2fs -m 0 /dev/disk/by-id/scsi-0DO_Volume_zoo2
sudo mkdir -p /mnt/zoo; sudo mkdir -p /mnt/data; sudo mount -o discard,defaults /dev/disk/by-id/scsi-0DO_Volume_zoo2 /mnt/zoo; echo /dev/disk/by-id/scsi-0DO_Volume_zoo2 /mnt/zoo ext4 defaults,nofail,discard 0 0 | sudo tee -a /etc/fstab

The examples above are specific for digitalocean instances. The steps do not cover tasks like making disks available to the server.

ssh key for passwordless access created

Cloudera manager requires access to the required servers. Though it offers a password option, we strongly recommended to use the key based access. This means that you need to make sure that access with one public key is enabled to all servers in the cluster. We've simplified this part by creating a script, which we run from desktop.

bash scripts/dist_ssh_keys.sh

important system configuration settings and checks

vm.swapiness settings on to 1 [all nodes]

sysctl vm.swappiness=1
echo "vm.swappiness = 1" >> /etc/sysctl.conf

Traditionally the vm.swapiness was recommended to be set to 0, but now the recommended value is 1. Check this blog post

transparent hugepages

echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag
echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled
cat /sys/kernel/mm/transparent_hugepage/defrag
cat /sys/kernel/mm/transparent_hugepage/enabled
cat >> /etc/rc.local <<EOF
echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag
echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled
EOF

THP don't go well with Hadoop workloads and could cause degrade in performance.

forward and reverse host lookup

First and foremost check that /etc/hosts doesn't have any entries pointing to the hostname!

forward lookup

host `hostname`
getent hosts <FQDN>
nslookup <FQDN>

reverse lookup
```
getent hosts <IP>
nslookup <IP>
```

cloudera check

python -c 'import socket; print socket.getfqdn(), socket.gethostbyname(socket.getfqdn())'

correct NTP settings

Make sure you enable NTP on all of your hosts.

IPTables

For the workshop we will not setup any rules in IP tables. Though it is possible to return and setup IP tables at a later stage. Cloudera Manager has a section where all configured ports are visible and this can be used for IPTables setup.

SELinux

We suggest to disable SELinux on all boxes.

to check:

getenforce

database

This workshop covers a quick installation and configuration of mariadb.

Install the yum repo [all nodes]

cat > /etc/yum.repos.d/mariadb.repo <<'EOF'
# MariaDB 10.1 CentOS repository list - created 2017-01-03 13:50 UTC
# http://downloads.mariadb.org/mariadb/repositories/
[mariadb]
name = MariaDB
baseurl = http://yum.mariadb.org/10.1/centos7-amd64
gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
gpgcheck=1
EOF

install mariadb-server from yum [edge node]

yum -y install mariadb-server

stop the mariadb-server if running [edge node]

service mariadb stop

Add the recommended configuration [edge node]

cat > /etc/my.cnf <<'EOF'
[mysqld]
transaction-isolation = READ-COMMITTED
# Disabling symbolic-links is recommended to prevent assorted security risks;
# to do so, uncomment this line:
# symbolic-links = 0

key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 32M
thread_stack = 256K
thread_cache_size = 64
query_cache_limit = 8M
query_cache_size = 64M
query_cache_type = 1

max_connections = 550
#expire_logs_days = 10
#max_binlog_size = 100M

#log_bin should be on a disk with enough free space. Replace '/var/lib/mysql/mysql_binary_log' with an appropriate path for your system
#and chown the specified folder to the mysql user.
log_bin=/var/lib/mysql/mysql_binary_log

binlog_format = mixed

read_buffer_size = 2M
read_rnd_buffer_size = 16M
sort_buffer_size = 8M
join_buffer_size = 8M

# InnoDB settings
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit  = 2
innodb_log_buffer_size = 64M
innodb_buffer_pool_size = 4G
innodb_thread_concurrency = 8
innodb_flush_method = O_DIRECT
innodb_log_file_size = 512M

[mysqld_safe]
log-error=/var/log/mariadb/mariadb.log
pid-file=/var/run/mariadb/mariadb.pid
EOF

check https://www.cloudera.com/documentation/enterprise/latest/topics/install_cm_mariadb.html#install_cm_mariadb_config for details.

enable and start MariaDB [edge node]

systemctl enable mariadb
systemctl start mariadb

Next we'll set the root password and some basic security requirements [edge node]

$ sudo /usr/bin/mysql_secure_installation
[...]
Enter current password for root (enter for none):
OK, successfully used password, moving on...
[...]
Set root password? [Y/n] y
New password:
Re-enter new password:
Remove anonymous users? [Y/n] Y
[...]
Disallow root login remotely? [Y/n] N
[...]
Remove test database and access to it [Y/n] Y
[...]
Reload privilege tables now? [Y/n] Y
All done!

install the JDBC driver [all nodes]

wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.41.tar.gz
tar -xvzf mysql-connector-java-5.1.41.tar.gz
mkdir -p /usr/share/java/
cp mysql-connector-java-5.1.41/mysql-connector-java-5.1.41-bin.jar /usr/share/java/mysql-connector-java.jar

install the clients [master and worker nodes]

check that each node has the MariaDB repo. If not run step 1 again.

yum install -y MariaDB-client

enable access for Cloudera specific accounts [edge node]

mysql -u root -p
Enter password: <enter password>

create database amon DEFAULT CHARACTER SET utf8;
create database rman DEFAULT CHARACTER SET utf8;
create database metastore DEFAULT CHARACTER SET utf8;
create database sentry DEFAULT CHARACTER SET utf8;
create database nav DEFAULT CHARACTER SET utf8;
create database navms DEFAULT CHARACTER SET utf8;
create database scm DEFAULT CHARACTER SET utf8;
create database hue DEFAULT CHARACTER SET utf8;
create database oozie DEFAULT CHARACTER SET utf8;
grant all on amon.* TO 'amon'@'%' IDENTIFIED BY 'amon';
grant all on rman.* TO 'rman'@'%' IDENTIFIED BY 'rman';
grant all on metastore.* TO 'hive'@'%' IDENTIFIED BY 'hive';
grant all on sentry.* TO 'sentry'@'%' IDENTIFIED BY 'sentry';
grant all on nav.* TO 'nav'@'%' IDENTIFIED BY 'nav';
grant all on navms.* TO 'navms'@'%' IDENTIFIED BY 'navms';
grant all on scm.* TO 'scm'@'%' IDENTIFIED BY 'scm';
grant all on hue.* to 'hue'@'%' identified by 'hue';
grant all on hue.* to 'hue'@'localhost' identified by 'hue';
grant all privileges on oozie.* to 'oozie'@'localhost' identified by 'oozie';
grant all privileges on oozie.* to 'oozie'@'%' identified by 'oozie';
exit;

Simple passwords are setup to easy the flow of this workshop. In production consider stronger passwords.

Install Cloudera Manager

Add the Cloudera repo [all nodes]

cat > /etc/yum.repos.d/Cloudera.repo <<'EOF'
[cloudera-manager]
# Packages for Cloudera Manager, Version 5, on RedHat or CentOS 5 x86_64
name=Cloudera Manager
baseurl=https://archive.cloudera.com/cm5/redhat/7/x86_64/cm/5/
gpgkey =https://archive.cloudera.com/cm5/redhat/7/x86_64/cm/RPM-GPG-KEY-cloudera
gpgcheck = 1
EOF

Always double check the baseurl details so that you are looking at the right repository

Install a supported oracle jdk [edge node]

yum -y install oracle-j2sdk1.7

Install Cloudera Manager Server [edge node]

yum -y install cloudera-manager-daemons cloudera-manager-server

Do not start the server right now !!!

import the database schema [edge nodes]

The steps are a bit hard to find in the Cloudera documentation, but here is the link

Run the scm_prepare_database.sh script

/usr/share/cmf/schema/scm_prepare_database.sh mysql scm scm scm

Start Cloudera Manager Server
```
service cloudera-scm-server start
```
Its a good idea to watch what is happing in the logs in /var/log/cloudera-scm-server/cloudera-scm-server.log This is the magical line you are looking for: WebServerImpl:com.cloudera.server.cmf.WebServerImpl: Started Jetty server.
Open a browser and login to Cloudera Manager as admin with password admin

http://<edge_node>:7180

Run the instructions in the wizard a. Read and accept the end user license agreement.

b. Choose a Cloudera Manager edition (we will use Enterprise Data Hub trial). Press Continue.

c. Check the installer version shown in the info page. Press Continue.

d. Specify all hostnames and press search.

e. Verify that server FQDN's are correct, IP's are correct and all hosts are in ready state. If ok press Continue.

f. Choose desired repositories and installation options. (We will leave it as it is for now).

g. JDK installation selection. For the workshop tick both boxes.

h. DO NOT select single user mode.

i. Add the ssh private key you use for root. (Download the key from the server if not done)

j. Run the installation and check progress. All servers must have successful completion. Press Continue once done.

k. Let the parcels install. Press Continue.

l. In the cluster setup page choose the desired services. In the workshop we'll run core Hadoop.

m. Next we need to define the roles to the servers.

n. Click on the view by host.

o. In database setup enter credentials for the individual mariaDB databases.

p. Next step looks at the most important configuration.

Make sure that for the workshop you have the following settings.

* `dfs.block.size, dfs.blocksize` = 128MiB

* `dfs.datanode.failed.volumes.tolerated` = 0

* `dfs.data.dir, dfs.datanode.data.dir` = /mnt/data/dfs/dn

* `dfs.name.dir, dfs.namenode.name.dir` = /mnt/data/dfs/nn

* `fs.checkpoint.dir, dfs.namenode.checkpoint.dir` = /mnt/data/dfs/snn

* `hive.metastore.warehouse.dir` = /user/hive/warehouse

* `hive.metastore.port` = 9083

* `yarn.nodemanager.local-dirs` = /mnt/data/yarn/nm

* `dataDir` = /mnt/zoo/zookeeper

* `dataLogDir` = /mnt/zoo/zookeeper

q. Run and complete the cluster setup.

r. Press Finish.

Restart the cluster if required
Setup HA a. while logged into Cloudera Manager click on the HDFS service in Cluster 1 b. click on Actions button and then navigate to Enable High Availability c. confirm the new nameservice d. select a second namenode and 3 journal nodes e. enter directory locations for namenode data (/mnt/data/dfs/nn) and journal node edits (/mnt/data/dfs/jn) f. leave checkboxes as they are at the end of the screen g. run the HA migration and wait until complete h. check the new role instances in HDFS -> Instances section of Cloudera Manager
Install Spark2.1

a. Download the Spark 2 CSD

wget http://archive.cloudera.com/spark2/csd/SPARK2_ON_YARN-2.1.0.cloudera1.jar

b. move the jar to the csd directory

mv SPARK2_ON_YARN-2.1.0.cloudera1.jar /opt/cloudera/csd

c. change permissions and ownership

chown cloudera-scm:cloudera-scm /opt/cloudera/csd/SPARK2_ON_YARN-2.1.0.cloudera1.jar
chmod 644 /opt/cloudera/csd/SPARK2_ON_YARN-2.1.0.cloudera1.jar

d. restart Cloudera Manager

service cloudera-scm-server restart

e. log back into Cloudera Manager and click on the "Stale Configuration" icon

f. click Restart Cloudera Management Service

g. click the parcels icon

h. Find SPARK2 in the list and click Download

i. Next click on Distribute

j. Finally click on Activate

k. Go back to the main screen

l. Next to Cluster 1 click on the dropdown and select "Add Service"

m. Select Spark 2

n. Choose the dependency with Hive, HDFS, YARN, Zookeeper

o. Add the Spark History Server role to one of the master nodes.

p. Select all hosts as gateway roles for Spark

q. Press Continue and wait until successfully completed

r. Refresh any "Stale Configuration"

Install Jupyter

a. install developement tools for CentOS 7 (gcc,make,...)

yum groupinstall "Development Tools"

b. install python-devel

yum install python-devel

c. install pip

curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"
python get-pip.py

d. install Jupyter

pip install jupyter

e. run with pyspark

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --ip=0.0.0.0 --allow-root"
pyspark2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

installation_steps.md

installation_steps.md

Installing a Hadoop cluster with Cloudera Manager

Assumptions

Official installation paths

Preinstall

Package updates and some handy tools

Mounting additional disks

ssh key for passwordless access created

important system configuration settings and checks

database

Install Cloudera Manager

Further reading

Files

installation_steps.md

Latest commit

History

installation_steps.md

File metadata and controls

Installing a Hadoop cluster with Cloudera Manager

Assumptions

Official installation paths

Preinstall

Package updates and some handy tools

Mounting additional disks

ssh key for passwordless access created

important system configuration settings and checks

database

Install Cloudera Manager

Further reading