Gridengine the Ubuntu/Debian way

I’ve had Sun GridEngine running on our cluster of 12-core HP blades from its earliest days. What has not been working is the the inter-host communication (the ability of the system to schedule and distribute jobs across the nodes). I therefore set out to fix this situation. It turns out that the problems that prevented this from working are mainly caused by quirks in the way that the Debian (and by inheritance, Ubuntu) packaging was done.

Prerequisites for gridengine: Most of the problems that I saw with the Debianised gridengine system are due to a lack of these prerequisites:

1. check the hosts file for localhost.localdomain type entries. If these are present, they will cause host communication to fail. Ensure that, at minimum, there is an entry in the hosts file of the master for each exec node, and in the hosts file of the exec nodes there should be an entry for the master. For example:

I will set up a cluster between my desktop machine, KWIAT22 and my laptop, caleb.
/etc/hosts on KWIAT22 contains:

127.0.0.1       localhost
#127.0.0.1      localhost.localdomain   localhost
129.67.46.129   KWIAT22
129.67.46.255   caleb

plus some other irrelevant entries. Note that localhost.localdomain is commented out.
/etc/hosts on caleb contains:

127.0.0.1       caleb
#127.0.0.1      localhost.localdomain   localhost
129.67.46.255   caleb
129.67.46.129   KWIAT22

Note again, the localhost.localdomain entry has been commented out.

2. Java is required for inter-host communication. We will use Sun Java, as it is assumed to be most compatible with Sun GridEngine. Edit /etc/apt/sources.list and uncomment the entries for the partner repository:

deb http://archive.canonical.com/ubuntu maverick partner
deb-src http://archive.canonical.com/ubuntu maverick partner

Then install the JRE:

apt-get install sun-java6-jre

Check which version of java we’ve got selected:

root@caleb:~# java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.1) (6b22-1.10.1-0ubuntu1)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)

From that we can see that I still have OpenJDK selected, so we change that:

root@caleb:~# update-alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection    Path                                      Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-6-openjdk/jre/bin/java   1061      auto mode
  1            /usr/lib/jvm/java-6-openjdk/jre/bin/java   1061      manual mode
  2            /usr/lib/jvm/java-6-sun/jre/bin/java       63        manual mode

Press enter to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-6-sun/jre/bin/java to provide /usr/bin/java (java) in manual mode.
root@caleb:~# java -version
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

Now that we have these prerequisites satisfied, we can install the relevant gridengine packages. Installing gridengine on Ubuntu systems is made simple by the packages. We can install the packages on the master node (in our case KWIAT22):

apt-get install gridengine-client gridengine-qmon gridengine-exec gridengine-master

Configure SGE automatically? Yes
SGE cell name: default
SGE master hostname: KWIAT22 (this should be the fully qualified domain name of the SGE master, not localhost)

Output will typically look something like this:

Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
  gridengine-common
The following NEW packages will be installed:
  gridengine-client gridengine-common gridengine-exec gridengine-master gridengine-qmon
0 upgraded, 5 newly installed, 0 to remove and 37 not upgraded.
Need to get 0 B/18.7 MB of archives.
After this operation, 44.8 MB of additional disk space will be used.
Do you want to continue [Y/n]?
Preconfiguring packages ...
Selecting previously deselected package gridengine-common.
(Reading database ... 372804 files and directories currently installed.)
Unpacking gridengine-common (from .../gridengine-common_6.2u5-1ubuntu1_all.deb) ...
Selecting previously deselected package gridengine-client.
Unpacking gridengine-client (from .../gridengine-client_6.2u5-1ubuntu1_amd64.deb) ...
Selecting previously deselected package gridengine-exec.
Unpacking gridengine-exec (from .../gridengine-exec_6.2u5-1ubuntu1_amd64.deb) ...
Selecting previously deselected package gridengine-master.
Unpacking gridengine-master (from .../gridengine-master_6.2u5-1ubuntu1_amd64.deb) ...
Selecting previously deselected package gridengine-qmon.
Unpacking gridengine-qmon (from .../gridengine-qmon_6.2u5-1ubuntu1_amd64.deb) ...
Processing triggers for man-db ...
Processing triggers for ureadahead ...
Setting up gridengine-common (6.2u5-1ubuntu1) ...

Creating config file /etc/default/gridengine with new version
Setting up gridengine-client (6.2u5-1ubuntu1) ...
Setting up gridengine-exec (6.2u5-1ubuntu1) ...
error: communication error for "KWIAT22/execd/1" running on port 6445: "can't bind socket"
error: commlib error: can't bind socket (no additional information available)
..........................
critical error: abort qmaster registration due to communication errors
daemonize error: child exited before sending daemonize state
Setting up gridengine-master (6.2u5-1ubuntu1) ...
Initializing cluster with the following parameters:
 => SGE_ROOT: /var/lib/gridengine
 => SGE_CELL: default
 => Spool directory: /var/spool/gridengine/spooldb
 => Initial manager user: sgeadmin
Initializing spool (/var/spool/gridengine/spooldb)
Initializing global configuration based on /usr/share/gridengine/default-configuration
Initializing complexes based on /usr/share/gridengine/centry
Initializing usersets based on /usr/share/gridengine/usersets
Adding user sgeadmin as a manager
Cluster creation complete
Setting up gridengine-qmon (6.2u5-1ubuntu1) ...

Note that the execd cannot bind the socket. This occurs because of a left-over execd that failed to stop from a previous install. It also results if you don’t have java installed, as the execd won’t respond to /etc/init.d/gridengine-exec stop without java. Also, if you’re doing an apt-get purge gridengine-* to get back to a fresh slate, typically the execd will not be stopped properly, despite being removed from the system. This can be fixed by:

root@KWIAT22:~# ps aux |grep sge
sgeadmin 22244 0.0 0.0 135172 4940 ? Sl 17:42 0:00 /usr/lib/gridengine/sge_qmaster
sgeadmin 24272 0.0 0.0  58688 2500 ? Sl May16 0:22 /usr/lib/gridengine/sge_execd
root@KWIAT22:~# kill 24272
root@KWIAT22:~# /etc/init.d/gridengine-exec start
root@KWIAT22:~# /etc/init.d/gridengine-master restart
 * Restarting Sun Grid Engine Master Scheduler sge_qmaster

The logfiles we can use for tracking down problems in communication between the qmaster and execd processes are not in the standard debian/ubuntu locations. Instead, they are stored in /var/spool/gridengine/execd/messages for the qmaster and /tmp/execd_messages.[pid] or /var/spool/gridengine/execd/messages for the execd processes. The log messages for our previous socket problem look like this (/tmp/execd_messages.24107):

05/16/2011 20:17:16|  main|KWIAT22|E|communication error for "KWIAT22/execd/1" running on port 6445: "can't bind socket"
05/16/2011 20:17:17|  main|KWIAT22|E|commlib error: can't bind socket (no additional information available)
05/16/2011 20:17:45|  main|KWIAT22|C|abort qmaster registration due to communication errors
05/16/2011 20:17:47|  main|KWIAT22|W|daemonize error: child exited before sending daemonize state

If you see any lines containing |E| then you have an error that must be addressed. Any lines with |W| are warnings, and it’s probably wise to fix those too.

On the exec nodes:

apt-get install gridengine-exec

Configure SGE automatically? yes
SGE cell name: default
SGE master hostname: KWIAT22

After installing, you will see the following error in the /tmp/exed_messages.[pid] file and the process will exit:

05/18/2011 17:53:00|  main|caleb|E|getting configuration: denied: host "caleb" is neither submit nor admin host
05/18/2011 17:53:05|  main|caleb|C|can't get configuration qmaster - terminating

This occurs because the master doesn’t yet know about the exec node. We need to set up a basic configuration on the master. We will use the documentation in /usr/share/doc/gridengine-common/README.Debian, which I will duplicate here, to form the basis of our configuration:

Once you've installed SGE, you'll need to do at least some minimal
cluster configuration.

Quickstart
==========

 * Install gridengine-master, gridengine-exec and gridengine-client
   on the appropriate hosts.

 * Initially, only the sgeadmin user has admin privileges

 * It is suggested that you add yourself as a manager and
   perform the rest of these tasks as your own user:
   + sudo -u sgeadmin qconf -am myuser

 * and to a userlist:
   + qconf -au myuser users

 * Add a submission host:
   + qconf -as myhost.mydomain

 * Add an execution host:
   + qconf -ae
   You will now be prompted for information about the execution host.

 * Add a new host group:
   + qconf -ahgrp @allhosts

 * Add the exec host to the @allhosts list:
   + qconf -aattr hostgroup hostlist myhost.mydomain @allhosts

 * Add a queue:
   + qconf -aq main.q

 * Add the host group to the queue:
   + qconf -aattr queue hostlist @allhosts main.q

 * Make sure there is a slot allocated to the execd:
   + qconf -aattr queue slots "[myhost.mydomain=1]" main.q

 * Running qstat -f should then show you the execd waiting for jobs

The commands that I ran in my example:

sudo su
sudo -u sgeadmin qconf -am rwh
exit
qconf -au rwh users
qconf -as KWIAT22
qconf -ahgrp @allhosts  # just save the file without modifying it
qconf -aattr hostgroup hostlist KWIAT22 @allhosts
qconf -aq main.q # just save the file without modifying it
qconf -aattr queue hostlist @allhosts main.q
qconf -aattr queue slots "4, [KWIAT22=3]" main.q # 4 by default for all nodes, 3 specifically for KWIAT22, which leaves 1 of the 4 cpus free for the master process

we then add caleb as a submit and exec host:

qconf -as caleb
qconf -ae # change the hostname entry to caleb
qconf -aattr hostgroup hostlist KWIAT22 @allhosts

Once this is done, we need to start the execd on caleb

/etc/init.d/gridengine-exec start

Check that it doesn’t create a log file in /tmp/execd_messages.[pid]. If it doesn’t then it’s happy! Back on our master node, a qstat -f should now show us all set up. You can use the GUI qmon tool to get a better look at the setup. To use qmon, you must ssh to the master node with X11 forwarding enabled:

ssh -X hostname
qmon

Click the queue control button and then the Hosts tab. If the exec nodes are communicating properly with the master, you should see them listed there, and they should NOT have dashes for the information columns. If a node does show dashes, it’s not communicating correctly, and you’ll need to go look in the log files for the reason. Note that if java is not installed, the communication between nodes will not work, and this may or may not show up in the log files.

Failure to communicate with the exec node, caleb

Success communicating with the exec node, caleb, after installing and selecting Sun Java.

Next, we need to set up a parallel environment. This will allow gridengine to start processes on the remote exec nodes. We can do this with qmon, though it’s also possible with the CLI tools. In qmon, click the bottom left button, Parallel Environment Configuration. Click Add. In our example, we’re setting up the simplest form of parallel environment, which doesn’t include any message passing functionality. Set the Name to simple_pe. In our case, we have two 4-core machines, with one core reserved for the master process, so we have 7 slots. The rest we leave as default values, just click OK, then click done. Now click the top, second from the left, button Cluster Queues. On the Cluster Queues tab, click main.q, then click Modify. Click the parallel environment tab, then click simple_pe and move it over to the referenced PEs box. Click OK, and Done.

Lastly, you need to set up passwordless ssh access from the master node to the exec nodes for the users of the gridengine system. This is left as an exercise for the reader, but you might start with learning about OpenSSH key management.

This entry was posted in HOWTOs and tagged , , , , , , , , , , , , , , . Bookmark the permalink.

12 Responses to Gridengine the Ubuntu/Debian way

  1. David says:

    Sir,

    Where i can get the GRid engine packages

  2. rwh says:

    Hi David,

    You simply install them from the universe repository, with apt-get install. Have a look in /etc/apt/sources.list and make sure that there’s a line ending in universe. They start off commented out. I understand you can set them up by configuring the software sources through synaptic or update manager as well. Here are my lines:

    ## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu
    ## team. Also, please note that software in universe WILL NOT receive any
    ## review or updates from the Ubuntu security team.
    deb http://mirror.ox.ac.uk/sites/archive.ubuntu.com/ubuntu/ natty universe
    deb-src http://mirror.ox.ac.uk/sites/archive.ubuntu.com/ubuntu/ natty universe
    deb http://mirror.ox.ac.uk/sites/archive.ubuntu.com/ubuntu/ natty-updates universe
    deb-src http://mirror.ox.ac.uk/sites/archive.ubuntu.com/ubuntu/ natty-updates universe

    Once you have your sources set up properly, you can find the installation commands in my blog post.

    Regards,

    Rob

  3. mohammad naquiddin abd razak says:

    David,

    thanks a lot for posting this. It works perfectly fine on my machines. However, sometime i’d caught by failure to communicate with the exec node and i’ve had this problem on 8 out of 12 machines that i had.
    Then, i removed the package and reinstall it. It works perfectly fine after that but i just wonder why it is so difficult to get the exec node working on the first installation. i’ve repeated the installation procedure twice and sometime more than that. I would love to share the the steps as below. They worked fine on my machines 🙂

    VANDERBILT\abdrazm@kyuss:~$ sudo apt-get update
    VANDERBILT\abdrazm@masi-7:~$ sudo apt-get install openjdk-6-jre
    VANDERBILT\abdrazm@kyuss:~$ sudo apt-get –purge remove gridengine-common gridengine-exec gridengine-client
    VANDERBILT\abdrazm@kyuss:~$ sudo ps aux |grep sge
    VANDERBILT\abdrazm@kyuss:~$ sudo kill ……………..
    VANDERBILT\abdrazm@kyuss:~$ sudo apt-get install gridengine-exec gridengine-client

    Regards,
    Din

  4. mohammad naquiddin abd razak says:

    sorry, the comment earlier was intended to the owner of this website (Rob?). anyway, thanks again

  5. rwh says:

    Hi Din,

    Not sure what’s going on there, I suspect it’s something to do with some side-effect of restarting the services, or perhaps slightly different configuration on subsequent installs?

    Cheers,

    Rob

  6. Benson Margulies says:

    With ubuntu 11.04, no amount of configuring fixes commlib errors related to localhost.

    With the following ‘/etc/hosts’, and act_qmanager containing ciderpress.basistech.net, all q commands fail with a commlib error whinging about ‘localhost’. I got around it by setting act_qmanager to contain ‘localhost’ since I only have one machine for now.

    127.0.0.1 ciderpress.basistech.net localhost
    127.0.1.1 ciderpress.basistech.net ciderpress

  7. mrcaq says:

    hello there,

    i’ve installed sge by following the above instructions and they are working fine when i submit the job using qsub/qrsh. however, i’ve got a problem with my qsh and qlogin. Is it necessary to get this problem fixed because i’ve already had my qsub working perfectly fine.

    if i should fix it, where could i find a link to get started?

  8. rwh says:

    mrcaq: You’ll need to me more specific what the problem is. Do you get some kind of error message?

  9. mrcaq says:

    ~$ qsh
    Your job 1035 (“INTERACTIVE”) has been submitted
    waiting for interactive job to be scheduled …
    Could not start interactive job.

    ~$ qlogin
    local configuration *mrcaqcom* not defined – using global configuration
    Your job 1036 (“QLOGIN”) has been submitted
    waiting for interactive job to be scheduled …
    Your interactive job 1036 has been successfully scheduled.
    Establishing /usr/share/gridengine/qlogin-wrapper session to host *mrcaqhost*
    error: Could not exec /usr/share/gridengine/qlogin-wrapper: No such file or directory
    /usr/share/gridengine/qlogin-wrapper exited with exit code 1

  10. ryan says:

    THANK YOU for posting this guide.

  11. Rauf says:

    Thanks for very good tutorial
    Please note that different versions of Ubuntu have different versions of gridengine* packages, and this difference also makes error when communicating with exec nodes.
    Ubuntu 10.04 has version 6.2u4-2ubuntu1
    While 10.10’s version is 6.2u5-2ubuntu1

  12. Verahill says:

    Rob,
    thank you for the gridengine how-to. I spent some time a couple of months trying to figure out SGE on Debian but gave up and ended up writing my own (very basic) queue manager instead. Looking at your guide all the pieces finally fell into place and I now have a very cool little cluster.
    Cheers!

Leave a Reply

Your email address will not be published. Required fields are marked *