I’ve had Sun GridEngine running on our cluster of 12-core HP blades from its earliest days. What has not been working is the the inter-host communication (the ability of the system to schedule and distribute jobs across the nodes). I therefore set out to fix this situation. It turns out that the problems that prevented this from working are mainly caused by quirks in the way that the Debian (and by inheritance, Ubuntu) packaging was done.
Prerequisites for gridengine: Most of the problems that I saw with the Debianised gridengine system are due to a lack of these prerequisites:
1. check the hosts file for localhost.localdomain type entries. If these are present, they will cause host communication to fail. Ensure that, at minimum, there is an entry in the hosts file of the master for each exec node, and in the hosts file of the exec nodes there should be an entry for the master. For example:
I will set up a cluster between my desktop machine, KWIAT22 and my laptop, caleb.
/etc/hosts on KWIAT22 contains:
127.0.0.1 localhost #127.0.0.1 localhost.localdomain localhost 129.67.46.129 KWIAT22 129.67.46.255 caleb
plus some other irrelevant entries. Note that localhost.localdomain is commented out.
/etc/hosts on caleb contains:
127.0.0.1 caleb #127.0.0.1 localhost.localdomain localhost 129.67.46.255 caleb 129.67.46.129 KWIAT22
Note again, the localhost.localdomain entry has been commented out.
2. Java is required for inter-host communication. We will use Sun Java, as it is assumed to be most compatible with Sun GridEngine. Edit /etc/apt/sources.list and uncomment the entries for the partner repository:
deb http://archive.canonical.com/ubuntu maverick partner deb-src http://archive.canonical.com/ubuntu maverick partner
Then install the JRE:
apt-get install sun-java6-jre
Check which version of java we’ve got selected:
root@caleb:~# java -version java version "1.6.0_22" OpenJDK Runtime Environment (IcedTea6 1.10.1) (6b22-1.10.1-0ubuntu1) OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)
From that we can see that I still have OpenJDK selected, so we change that:
root@caleb:~# update-alternatives --config java There are 2 choices for the alternative java (providing /usr/bin/java). Selection Path Priority Status ------------------------------------------------------------ * 0 /usr/lib/jvm/java-6-openjdk/jre/bin/java 1061 auto mode 1 /usr/lib/jvm/java-6-openjdk/jre/bin/java 1061 manual mode 2 /usr/lib/jvm/java-6-sun/jre/bin/java 63 manual mode Press enter to keep the current choice[*], or type selection number: 2 update-alternatives: using /usr/lib/jvm/java-6-sun/jre/bin/java to provide /usr/bin/java (java) in manual mode. root@caleb:~# java -version java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
Now that we have these prerequisites satisfied, we can install the relevant gridengine packages. Installing gridengine on Ubuntu systems is made simple by the packages. We can install the packages on the master node (in our case KWIAT22):
apt-get install gridengine-client gridengine-qmon gridengine-exec gridengine-master
Configure SGE automatically? Yes
SGE cell name: default
SGE master hostname: KWIAT22 (this should be the fully qualified domain name of the SGE master, not localhost)
Output will typically look something like this:
Reading package lists... Done Building dependency tree Reading state information... Done The following extra packages will be installed: gridengine-common The following NEW packages will be installed: gridengine-client gridengine-common gridengine-exec gridengine-master gridengine-qmon 0 upgraded, 5 newly installed, 0 to remove and 37 not upgraded. Need to get 0 B/18.7 MB of archives. After this operation, 44.8 MB of additional disk space will be used. Do you want to continue [Y/n]? Preconfiguring packages ... Selecting previously deselected package gridengine-common. (Reading database ... 372804 files and directories currently installed.) Unpacking gridengine-common (from .../gridengine-common_6.2u5-1ubuntu1_all.deb) ... Selecting previously deselected package gridengine-client. Unpacking gridengine-client (from .../gridengine-client_6.2u5-1ubuntu1_amd64.deb) ... Selecting previously deselected package gridengine-exec. Unpacking gridengine-exec (from .../gridengine-exec_6.2u5-1ubuntu1_amd64.deb) ... Selecting previously deselected package gridengine-master. Unpacking gridengine-master (from .../gridengine-master_6.2u5-1ubuntu1_amd64.deb) ... Selecting previously deselected package gridengine-qmon. Unpacking gridengine-qmon (from .../gridengine-qmon_6.2u5-1ubuntu1_amd64.deb) ... Processing triggers for man-db ... Processing triggers for ureadahead ... Setting up gridengine-common (6.2u5-1ubuntu1) ... Creating config file /etc/default/gridengine with new version Setting up gridengine-client (6.2u5-1ubuntu1) ... Setting up gridengine-exec (6.2u5-1ubuntu1) ... error: communication error for "KWIAT22/execd/1" running on port 6445: "can't bind socket" error: commlib error: can't bind socket (no additional information available) .......................... critical error: abort qmaster registration due to communication errors daemonize error: child exited before sending daemonize state Setting up gridengine-master (6.2u5-1ubuntu1) ... Initializing cluster with the following parameters: => SGE_ROOT: /var/lib/gridengine => SGE_CELL: default => Spool directory: /var/spool/gridengine/spooldb => Initial manager user: sgeadmin Initializing spool (/var/spool/gridengine/spooldb) Initializing global configuration based on /usr/share/gridengine/default-configuration Initializing complexes based on /usr/share/gridengine/centry Initializing usersets based on /usr/share/gridengine/usersets Adding user sgeadmin as a manager Cluster creation complete Setting up gridengine-qmon (6.2u5-1ubuntu1) ...
Note that the execd cannot bind the socket. This occurs because of a left-over execd that failed to stop from a previous install. It also results if you don’t have java installed, as the execd won’t respond to /etc/init.d/gridengine-exec stop without java. Also, if you’re doing an apt-get purge gridengine-* to get back to a fresh slate, typically the execd will not be stopped properly, despite being removed from the system. This can be fixed by:
root@KWIAT22:~# ps aux |grep sge sgeadmin 22244 0.0 0.0 135172 4940 ? Sl 17:42 0:00 /usr/lib/gridengine/sge_qmaster sgeadmin 24272 0.0 0.0 58688 2500 ? Sl May16 0:22 /usr/lib/gridengine/sge_execd root@KWIAT22:~# kill 24272 root@KWIAT22:~# /etc/init.d/gridengine-exec start root@KWIAT22:~# /etc/init.d/gridengine-master restart * Restarting Sun Grid Engine Master Scheduler sge_qmaster
The logfiles we can use for tracking down problems in communication between the qmaster and execd processes are not in the standard debian/ubuntu locations. Instead, they are stored in /var/spool/gridengine/execd/messages for the qmaster and /tmp/execd_messages.[pid] or /var/spool/gridengine/execd/messages for the execd processes. The log messages for our previous socket problem look like this (/tmp/execd_messages.24107):
05/16/2011 20:17:16| main|KWIAT22|E|communication error for "KWIAT22/execd/1" running on port 6445: "can't bind socket" 05/16/2011 20:17:17| main|KWIAT22|E|commlib error: can't bind socket (no additional information available) 05/16/2011 20:17:45| main|KWIAT22|C|abort qmaster registration due to communication errors 05/16/2011 20:17:47| main|KWIAT22|W|daemonize error: child exited before sending daemonize state
If you see any lines containing |E| then you have an error that must be addressed. Any lines with |W| are warnings, and it’s probably wise to fix those too.
On the exec nodes:
apt-get install gridengine-exec
Configure SGE automatically? yes
SGE cell name: default
SGE master hostname: KWIAT22
After installing, you will see the following error in the /tmp/exed_messages.[pid] file and the process will exit:
05/18/2011 17:53:00| main|caleb|E|getting configuration: denied: host "caleb" is neither submit nor admin host 05/18/2011 17:53:05| main|caleb|C|can't get configuration qmaster - terminating
This occurs because the master doesn’t yet know about the exec node. We need to set up a basic configuration on the master. We will use the documentation in /usr/share/doc/gridengine-common/README.Debian, which I will duplicate here, to form the basis of our configuration:
Once you've installed SGE, you'll need to do at least some minimal cluster configuration. Quickstart ========== * Install gridengine-master, gridengine-exec and gridengine-client on the appropriate hosts. * Initially, only the sgeadmin user has admin privileges * It is suggested that you add yourself as a manager and perform the rest of these tasks as your own user: + sudo -u sgeadmin qconf -am myuser * and to a userlist: + qconf -au myuser users * Add a submission host: + qconf -as myhost.mydomain * Add an execution host: + qconf -ae You will now be prompted for information about the execution host. * Add a new host group: + qconf -ahgrp @allhosts * Add the exec host to the @allhosts list: + qconf -aattr hostgroup hostlist myhost.mydomain @allhosts * Add a queue: + qconf -aq main.q * Add the host group to the queue: + qconf -aattr queue hostlist @allhosts main.q * Make sure there is a slot allocated to the execd: + qconf -aattr queue slots "[myhost.mydomain=1]" main.q * Running qstat -f should then show you the execd waiting for jobs
The commands that I ran in my example:
sudo su sudo -u sgeadmin qconf -am rwh exit qconf -au rwh users qconf -as KWIAT22 qconf -ahgrp @allhosts # just save the file without modifying it qconf -aattr hostgroup hostlist KWIAT22 @allhosts qconf -aq main.q # just save the file without modifying it qconf -aattr queue hostlist @allhosts main.q qconf -aattr queue slots "4, [KWIAT22=3]" main.q # 4 by default for all nodes, 3 specifically for KWIAT22, which leaves 1 of the 4 cpus free for the master process
we then add caleb as a submit and exec host:
qconf -as caleb qconf -ae # change the hostname entry to caleb qconf -aattr hostgroup hostlist KWIAT22 @allhosts
Once this is done, we need to start the execd on caleb
/etc/init.d/gridengine-exec start
Check that it doesn’t create a log file in /tmp/execd_messages.[pid]. If it doesn’t then it’s happy! Back on our master node, a qstat -f should now show us all set up. You can use the GUI qmon tool to get a better look at the setup. To use qmon, you must ssh to the master node with X11 forwarding enabled:
ssh -X hostname qmon
Click the queue control button and then the Hosts tab. If the exec nodes are communicating properly with the master, you should see them listed there, and they should NOT have dashes for the information columns. If a node does show dashes, it’s not communicating correctly, and you’ll need to go look in the log files for the reason. Note that if java is not installed, the communication between nodes will not work, and this may or may not show up in the log files.
Next, we need to set up a parallel environment. This will allow gridengine to start processes on the remote exec nodes. We can do this with qmon, though it’s also possible with the CLI tools. In qmon, click the bottom left button, Parallel Environment Configuration. Click Add. In our example, we’re setting up the simplest form of parallel environment, which doesn’t include any message passing functionality. Set the Name to simple_pe. In our case, we have two 4-core machines, with one core reserved for the master process, so we have 7 slots. The rest we leave as default values, just click OK, then click done. Now click the top, second from the left, button Cluster Queues. On the Cluster Queues tab, click main.q, then click Modify. Click the parallel environment tab, then click simple_pe and move it over to the referenced PEs box. Click OK, and Done.
Lastly, you need to set up passwordless ssh access from the master node to the exec nodes for the users of the gridengine system. This is left as an exercise for the reader, but you might start with learning about OpenSSH key management.
Sir,
Where i can get the GRid engine packages
Hi David,
You simply install them from the universe repository, with apt-get install. Have a look in /etc/apt/sources.list and make sure that there’s a line ending in universe. They start off commented out. I understand you can set them up by configuring the software sources through synaptic or update manager as well. Here are my lines:
## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu
## team. Also, please note that software in universe WILL NOT receive any
## review or updates from the Ubuntu security team.
deb http://mirror.ox.ac.uk/sites/archive.ubuntu.com/ubuntu/ natty universe
deb-src http://mirror.ox.ac.uk/sites/archive.ubuntu.com/ubuntu/ natty universe
deb http://mirror.ox.ac.uk/sites/archive.ubuntu.com/ubuntu/ natty-updates universe
deb-src http://mirror.ox.ac.uk/sites/archive.ubuntu.com/ubuntu/ natty-updates universe
Once you have your sources set up properly, you can find the installation commands in my blog post.
Regards,
Rob
David,
thanks a lot for posting this. It works perfectly fine on my machines. However, sometime i’d caught by failure to communicate with the exec node and i’ve had this problem on 8 out of 12 machines that i had.
Then, i removed the package and reinstall it. It works perfectly fine after that but i just wonder why it is so difficult to get the exec node working on the first installation. i’ve repeated the installation procedure twice and sometime more than that. I would love to share the the steps as below. They worked fine on my machines 🙂
VANDERBILT\abdrazm@kyuss:~$ sudo apt-get update
VANDERBILT\abdrazm@masi-7:~$ sudo apt-get install openjdk-6-jre
VANDERBILT\abdrazm@kyuss:~$ sudo apt-get –purge remove gridengine-common gridengine-exec gridengine-client
VANDERBILT\abdrazm@kyuss:~$ sudo ps aux |grep sge
VANDERBILT\abdrazm@kyuss:~$ sudo kill ……………..
VANDERBILT\abdrazm@kyuss:~$ sudo apt-get install gridengine-exec gridengine-client
Regards,
Din
sorry, the comment earlier was intended to the owner of this website (Rob?). anyway, thanks again
Hi Din,
Not sure what’s going on there, I suspect it’s something to do with some side-effect of restarting the services, or perhaps slightly different configuration on subsequent installs?
Cheers,
Rob
With ubuntu 11.04, no amount of configuring fixes commlib errors related to localhost.
With the following ‘/etc/hosts’, and act_qmanager containing ciderpress.basistech.net, all q commands fail with a commlib error whinging about ‘localhost’. I got around it by setting act_qmanager to contain ‘localhost’ since I only have one machine for now.
127.0.0.1 ciderpress.basistech.net localhost
127.0.1.1 ciderpress.basistech.net ciderpress
hello there,
i’ve installed sge by following the above instructions and they are working fine when i submit the job using qsub/qrsh. however, i’ve got a problem with my qsh and qlogin. Is it necessary to get this problem fixed because i’ve already had my qsub working perfectly fine.
if i should fix it, where could i find a link to get started?
mrcaq: You’ll need to me more specific what the problem is. Do you get some kind of error message?
~$ qsh
Your job 1035 (“INTERACTIVE”) has been submitted
waiting for interactive job to be scheduled …
Could not start interactive job.
~$ qlogin
local configuration *mrcaqcom* not defined – using global configuration
Your job 1036 (“QLOGIN”) has been submitted
waiting for interactive job to be scheduled …
Your interactive job 1036 has been successfully scheduled.
Establishing /usr/share/gridengine/qlogin-wrapper session to host *mrcaqhost*
error: Could not exec /usr/share/gridengine/qlogin-wrapper: No such file or directory
/usr/share/gridengine/qlogin-wrapper exited with exit code 1
THANK YOU for posting this guide.
Thanks for very good tutorial
Please note that different versions of Ubuntu have different versions of gridengine* packages, and this difference also makes error when communicating with exec nodes.
Ubuntu 10.04 has version 6.2u4-2ubuntu1
While 10.10’s version is 6.2u5-2ubuntu1
Rob,
thank you for the gridengine how-to. I spent some time a couple of months trying to figure out SGE on Debian but gave up and ended up writing my own (very basic) queue manager instead. Looking at your guide all the pieces finally fell into place and I now have a very cool little cluster.
Cheers!
i am getting problem in installing sge in ubuntu 8.04.. please help me out