Skip to content

ubuntu 11.04 cluster compute

jtriley edited this page Nov 14, 2011 · 4 revisions

Ubuntu 11.04 Compute Cluster AMI

As of 2011-07-18, no Ubuntu based public AMI is available to use with StarCluster on cluster compute instances. This should change soon (see issue #31). Here are my notes at setting up such an AMI based on Ubuntu 11.04 to use with StarCluster 0.92rc2. I have not made any attempt at getting it working with the cluster GPU instances.

Choose an available AMI

Use a Ubuntu 11.04 Natty HVM (cluster compute) AMI built by Canonical. In us-east-1 this is ami-1cad5275 (as of 2011-07-18) as suggested by http://alestic.com/. Launch it from the Alestic website. It shows up as:

(starcluster)user@localhost:~$ starcluster listinstances
StarCluster - (http://web.mit.edu/starcluster) (v. 0.92rc2)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to [email protected]

id: i-bd2fdedc
dns_name: ec2-50-17-72-101.compute-1.amazonaws.com
private_dns_name: ip-10-17-2-245.ec2.internal
state: running
public_ip: 50.17.72.101
private_ip: 10.17.2.245
zone: us-east-1b
ami: ami-1cad5275
type: cc1.4xlarge
groups: default
keypair: starcluster_1
uptime: 00:01:31

Log in

Log in as ubuntu user then switch to root:

user@localhost:~$ starcluster sshinstance -u ubuntu i-bd2fdedc
ubuntu@ip-...:~# sudo -i
root@ip-...:~#

Update the system:

root@ip-...:~# apt-get update
root@ip-...:~# apt-get upgrade

Configure root login

The AMI was configured to disable root logins but it is needed for StarCluster.
  1. edit /etc/cloud/cloud.cfg and set disable_root:0
  2. edit /root/.ssh/authorized_keys and remove prefix commands from pubkey entry
  3. edit /usr/bin/cloud-init, go to line 143 and change 'once-per-instance' to 'always'.

For NFS

Install the following packages:

root@ip-...:~# apt-get install portmap nfs-common nfs-kernel-server rxvt

And add a symbolic link (nfs is used by StarCluster, is present on starcluster's AMI based on Ubuntu 10.10, but is not available on Ubuntu 11.04):

root@ip-...:~# ln -s /etc/init.d/nfs-kernel-server nfs

Install SGE

StarCluster at version 0.92rc2 uses Sun GridEngine version 6.2u5 installed in /opt/sge6. Since Oracle bought Sun, installation files are not easily available anymore on the Internet.

Grab /opt/ from an existing 64-bits StarCluster AMI

Looking into StarCluster's Ubuntu 10.04 64-bits AMI, /opt/ contains /opt/sge6-fresh but also 2 versions of drmaa-python bindings (0.2 and 0.4b3) [not used by StarCluster anymore?].

After launching an AMI with AWS managing Console (public DNS looking like ec2-....compute-1.amazonaws.com) grab the folder with:

user@localhost:~$ ssh -i ~/.ssh/myssh_certificate.pem [email protected]
root@domU...:/# tar -caf opt_starcluster.tar.gz ./opt
user@localhost:~$ scp -i ~/.ssh/myssh_certificate.pem [email protected]:/opt_starcluster.tar.gz ./

That instance can now be terminated.

Copy SGE on the instance we are configuring

user@localhost:~$ scp -i ~/.ssh/myssh_certificate.pem opt_starcluster.tar.gz [email protected]:.
root@ip-...:~# tar -xf opt_starcluster.tar.gz
root@ip-...:~# mv ./opt /

Create the following symbolic link (because not in Natty):

root@ip-...:~# ln -s /lib64/x86_64-linux-gnu/libc-2.13.so /lib64/libc.so.6

/opt/sge6-fresh/util/arch/ needs to be patched as discussed in http://comments.gmane.org/gmane.comp.clustering.gridengine.users/21495.

Replacing lines 64 to 66 by:

ossysname="`$UNAME -s`" 2>/dev/null || ossysname=unknown
osmachine="`$UNAME -m`" 2>/dev/null || osmachine=unknown
osrelease="`$UNAME -r`" 2>/dev/null || osrelease=unknown

replace line 237 by:

libc=/lib64/libc.so.6

replace line 240 by (optional):

libc=/lib/libc.so.6.1

replace line 243 by (optional):

libc=/lib/libc.so.6

inserting a new line 247 as

libc_string=`$libc | head -n 1`

replace lines 253 and 254 by [there is a GitHub issue here regarding how the following text is formatted. One should read version [0-9]*\\. followed by \, (, [, 0, -, 9, ], *, \, ), " ]:

libc_version=`expr "$libc_string" : ".* version [0-9]*\\.\([0-9]*\)" 2>/dev/null`
if [ $? -ne 0 -o $libc_version -lt 2 ]; then

OpenMPI installation

Install the build dependences for the libopenmpi-dev package:

root@ip-...:~# apt-get build-dep libopenmpi-dev

Get the source for the libopenmpi-dev Debian package:

root@ip-...:~# cd /usr/local/src
root@ip-...:/usr/local/src# mkdir openmpi
root@ip-...:/usr/local/src/openmpi# apt-get source libopenmpi-dev

Change into the libopenmpi-dev package debian folder

root@ip-...:/usr/local/src/openmpi# cd openmpi-1.4.3/debian

Modify the rules file and add --with-sge to the configure arguments on line 61. Use tabs, not space! It should look like something close to (last 2 lines where modified):

COMMON_CONFIG_PARAMS = \
                      $(CROSS)                                \
                      $(CHKPT)                                \
                      $(NUMA)                                 \
                      --prefix=/usr                           \
                      --mandir=\$${prefix}/share/man          \
                      --infodir=\$${prefix}/share/info        \
                      --sysconfdir=/etc/openmpi               \
                      --libdir=\$${prefix}/lib/openmpi/lib    \
                      --includedir=\$${prefix}/lib/openmpi/include    \
                      --with-devel-headers \
                      --enable-heterogeneous \
                      $(TORQUE) \
                      --with-sge

Rebuild the libopenmpi-dev package:

root@ip-...:/usr/local/src/openmpi/openmpi-1.4.3/debian# cd ..
root@ip-...:/usr/local/src/openmpi/openmpi-1.4.3# dpkg-buildpackage -rfakeroot -b

Install the newly rebuilt package:

root@ip-...:/usr/local/src/openmpi/openmpi-1.4.3# cd ..
root@ip-...:/usr/local/src/openmpi# dpkg -i *.deb

Verify Sun Grid Engine support:

root@ip-...:~# ompi_info | grep -i grid

should return:

MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.3)

Check SGE and OpenMPI installation

Check SGE installation

First we need to make an AMI of the current instance (i-bd2fdedc below)

(starcluster)user@localhost:~$ starcluster ebsimage i-bd2fdedc test-sge-install -d 'Temporary AMI to test the SGE install'
StarCluster - (http://web.mit.edu/starcluster) (v. 0.92rc2)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to [email protected]

>>> Removing private data...
>>> Creating EBS image...
>>> Waiting for AMI ami-af11d5c6 to become available...
>>> create_image took 6.982 mins
>>> Your new AMI id is: ami-af11d5c6

Now let's check if it actually works. Add a new cluster configuration using the newly created AMI in ~/.starcluster/config, for example the following lines:

[cluster computecluster]
# change this to the name of one of the keypair sections defined above
KEYNAME = myssh_certificate
# number of ec2 instances to launch
CLUSTER_SIZE = 2
# create the following user on the cluster
CLUSTER_USER = sgeadmin
# optionally specify shell (defaults to bash)
CLUSTER_SHELL = bash
# AMI for cluster nodes.
NODE_IMAGE_ID = ami-af11d5c6
# instance type for all cluster nodes
NODE_INSTANCE_TYPE = cc1.4xlarge
# list of volumes to attach to the master node (OPTIONAL)
# these volumes, if any, will be NFS shared to the worker nodes
# VOLUMES = computeclusterhome

Then we can launch the cluster:

user@localhost:~$ starcluster spothistory -d 50 cc1.4xlarge
user@localhost:~$ starcluster start --login-master --bid=1.6 --cluster-template=computecluster testcluster

Log into the cluster with:

user@localhost:~$ starcluster sshmaster -u sgeadmin testcluster

and verify the installation by following Sun's procedure.

OpenMPI verification

On the testcluster created above we compile and run a "Hello World" OpenMPI program:

sgeadmin@master:~# mkdir mpi_test
sgeadmin@master:~# cd mpi_test
sgeadmin@master:~/mpi_test# vim mpi_hello.c

Write the following program source code as mpi_hello.c:

/*The Parallel Hello World Program*/
#include <stdio.h> /* printf and BUFSIZ defined there */
#include <stdlib.h> /* exit defined there */
#include <mpi.h> /* all MPI-2 functions defined there */

main(int argc, char **argv)
{
  int rank, size, length;
  char name[BUFSIZ];

  MPI_Init(&argc,&argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Get_processor_name(name, &length);

  printf("%s: Hello World from process %d of %d\n", name, rank, size);
  MPI_Finalize();

  exit(0);
}

Compile it:

sgeadmin@master:~/mpi_test# mpicc ./mpi_hello.c -o ./mpi_hello

And run it first with mpi then using SGE:

sgeadmin@master:~/mpi_test# mpirun -n 16 ./mpi_hello
sgeadmin@master:~/mpi_test# qsub -b y -cwd -pe orte 24 mpirun ./mpi_hello

Check output files in sgeadmin's home directory.

Install ATLAS (and lapack)

Do a custom build, see README.Debian (link to a version newer than the one in Natty to get right info).

First gets the build dependencies (devscripts was missing somehow):

root@ip-...:~# apt-get build-dep atlas
root@ip-...:~# apt-get install devscripts

The package version in Natty (atlas-3.8.3-29) fails when trying to do a custom build. It is a confirmed package bug. I choose to backport the Oneiric package 3.8.4-3. Oneiric is not released as of 2011-07-18. It goes like this:

root@ip-...:~# cd /usr/local/src
root@ip-...:/usr/local/src/# mkdir atlas

root@ip-...:/usr/local/src/atlas# wget https://launchpad.net/ubuntu/oneiric/+source/atlas/3.8.4-3/+files/atlas_3.8.4.orig.tar.bz2
root@ip-...:/usr/local/src/atlas# wget https://launchpad.net/ubuntu/oneiric/+source/atlas/3.8.4-3/+files/atlas_3.8.4-3.debian.tar.gz
root@ip-...:/usr/local/src/atlas# wget https://launchpad.net/ubuntu/oneiric/+source/atlas/3.8.4-3/+files/atlas_3.8.4-3.dsc

root@ip-...:/usr/local/src/atlas# cd atlas-3.8.4
root@ip-...:/usr/local/src/atlas/atlas-3.8.4# fakeroot debian/rules custom
root@ip-...:/usr/local/src/atlas/atlas-3.8.4# cd ..
root@ip-...:/usr/local/src/atlas# dpkg -i *.deb

Lapack was installed with the dependencies.