{"id":159,"date":"2011-06-01T22:39:49","date_gmt":"2011-06-01T12:39:49","guid":{"rendered":"http:\/\/helms-deep.net\/~rwh\/blog\/?p=159"},"modified":"2011-06-02T23:33:59","modified_gmt":"2011-06-02T13:33:59","slug":"gridengine-on-ubuntu-11-04","status":"publish","type":"post","link":"https:\/\/helms-deep.net\/~rwh\/blog\/?p=159","title":{"rendered":"Gridengine the Ubuntu\/Debian way"},"content":{"rendered":"<p>I&#8217;ve had Sun GridEngine running on our cluster of 12-core HP blades from its earliest days. What has not been working is the the inter-host communication (the ability of the system to schedule and distribute jobs across the nodes). I therefore set out to fix this situation. It turns out that the problems that prevented this from working are mainly caused by quirks in the way that the Debian (and by inheritance, Ubuntu) packaging was done.<!--more--><\/p>\n<p>Prerequisites for gridengine: Most of the problems that I saw with the Debianised gridengine system are due to a lack of these prerequisites:<\/p>\n<p><strong>1. check the hosts file for localhost.localdomain type entries.<\/strong> If these are present, they will cause host communication to fail. Ensure that, at minimum, there is an entry in the hosts file of the master for each exec node, and in the hosts file of the exec nodes there should be an entry for the master. For example:<\/p>\n<p>I will set up a cluster between my desktop machine, KWIAT22 and my laptop, caleb.<br \/>\n\/etc\/hosts on KWIAT22 contains:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">127.0.0.1       localhost\r\n#127.0.0.1      localhost.localdomain   localhost\r\n129.67.46.129   KWIAT22\r\n129.67.46.255   caleb<\/pre>\n<p>plus some other irrelevant entries. Note that localhost.localdomain is commented out.<br \/>\n\/etc\/hosts on caleb contains:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">127.0.0.1       caleb\r\n#127.0.0.1      localhost.localdomain   localhost\r\n129.67.46.255   caleb\r\n129.67.46.129   KWIAT22<\/pre>\n<p>Note again, the localhost.localdomain entry has been commented out.<\/p>\n<p><strong>2. Java is required for inter-host communication.<\/strong> We will use Sun Java, as it is assumed to be most compatible with Sun GridEngine. Edit \/etc\/apt\/sources.list and uncomment the entries for the partner repository:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">deb http:\/\/archive.canonical.com\/ubuntu maverick partner\r\ndeb-src http:\/\/archive.canonical.com\/ubuntu maverick partner<\/pre>\n<p>Then install the JRE:<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">apt-get install sun-java6-jre<\/pre>\n<p>Check which version of java we&#8217;ve got selected:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">root@caleb:~# java -version\r\njava version &quot;1.6.0_22&quot;\r\nOpenJDK Runtime Environment (IcedTea6 1.10.1) (6b22-1.10.1-0ubuntu1)\r\nOpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)<\/pre>\n<p>From that we can see that I still have OpenJDK selected, so we change that:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">root@caleb:~# update-alternatives --config java\r\nThere are 2 choices for the alternative java (providing \/usr\/bin\/java).\r\n\r\n  Selection    Path                                      Priority   Status\r\n------------------------------------------------------------\r\n* 0            \/usr\/lib\/jvm\/java-6-openjdk\/jre\/bin\/java   1061      auto mode\r\n  1            \/usr\/lib\/jvm\/java-6-openjdk\/jre\/bin\/java   1061      manual mode\r\n  2            \/usr\/lib\/jvm\/java-6-sun\/jre\/bin\/java       63        manual mode\r\n\r\nPress enter to keep the current choice&#x5B;*], or type selection number: 2\r\nupdate-alternatives: using \/usr\/lib\/jvm\/java-6-sun\/jre\/bin\/java to provide \/usr\/bin\/java (java) in manual mode.\r\nroot@caleb:~# java -version\r\njava version &quot;1.6.0_24&quot;\r\nJava(TM) SE Runtime Environment (build 1.6.0_24-b07)\r\nJava HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)<\/pre>\n<p>Now that we have these prerequisites satisfied, we can <strong>install the relevant gridengine packages<\/strong>. Installing gridengine on Ubuntu systems is made simple by the packages. We can install the packages on the master node (in our case KWIAT22):<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">apt-get install gridengine-client gridengine-qmon gridengine-exec gridengine-master<\/pre>\n<p>Configure SGE automatically? Yes<br \/>\nSGE cell name: default<br \/>\nSGE master hostname: KWIAT22 (this should be the fully qualified domain name of the SGE master, not localhost)<\/p>\n<p>Output will typically look something like this:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">Reading package lists... Done\r\nBuilding dependency tree\r\nReading state information... Done\r\nThe following extra packages will be installed:\r\n  gridengine-common\r\nThe following NEW packages will be installed:\r\n  gridengine-client gridengine-common gridengine-exec gridengine-master gridengine-qmon\r\n0 upgraded, 5 newly installed, 0 to remove and 37 not upgraded.\r\nNeed to get 0 B\/18.7 MB of archives.\r\nAfter this operation, 44.8 MB of additional disk space will be used.\r\nDo you want to continue &#x5B;Y\/n]?\r\nPreconfiguring packages ...\r\nSelecting previously deselected package gridengine-common.\r\n(Reading database ... 372804 files and directories currently installed.)\r\nUnpacking gridengine-common (from ...\/gridengine-common_6.2u5-1ubuntu1_all.deb) ...\r\nSelecting previously deselected package gridengine-client.\r\nUnpacking gridengine-client (from ...\/gridengine-client_6.2u5-1ubuntu1_amd64.deb) ...\r\nSelecting previously deselected package gridengine-exec.\r\nUnpacking gridengine-exec (from ...\/gridengine-exec_6.2u5-1ubuntu1_amd64.deb) ...\r\nSelecting previously deselected package gridengine-master.\r\nUnpacking gridengine-master (from ...\/gridengine-master_6.2u5-1ubuntu1_amd64.deb) ...\r\nSelecting previously deselected package gridengine-qmon.\r\nUnpacking gridengine-qmon (from ...\/gridengine-qmon_6.2u5-1ubuntu1_amd64.deb) ...\r\nProcessing triggers for man-db ...\r\nProcessing triggers for ureadahead ...\r\nSetting up gridengine-common (6.2u5-1ubuntu1) ...\r\n\r\nCreating config file \/etc\/default\/gridengine with new version\r\nSetting up gridengine-client (6.2u5-1ubuntu1) ...\r\nSetting up gridengine-exec (6.2u5-1ubuntu1) ...\r\nerror: communication error for &quot;KWIAT22\/execd\/1&quot; running on port 6445: &quot;can't bind socket&quot;\r\nerror: commlib error: can't bind socket (no additional information available)\r\n..........................\r\ncritical error: abort qmaster registration due to communication errors\r\ndaemonize error: child exited before sending daemonize state\r\nSetting up gridengine-master (6.2u5-1ubuntu1) ...\r\nInitializing cluster with the following parameters:\r\n =&gt; SGE_ROOT: \/var\/lib\/gridengine\r\n =&gt; SGE_CELL: default\r\n =&gt; Spool directory: \/var\/spool\/gridengine\/spooldb\r\n =&gt; Initial manager user: sgeadmin\r\nInitializing spool (\/var\/spool\/gridengine\/spooldb)\r\nInitializing global configuration based on \/usr\/share\/gridengine\/default-configuration\r\nInitializing complexes based on \/usr\/share\/gridengine\/centry\r\nInitializing usersets based on \/usr\/share\/gridengine\/usersets\r\nAdding user sgeadmin as a manager\r\nCluster creation complete\r\nSetting up gridengine-qmon (6.2u5-1ubuntu1) ...<\/pre>\n<p>Note that the execd cannot bind the socket. This occurs because of a left-over execd that failed to stop from a previous install. It also results if you don&#8217;t have java installed, as the execd won&#8217;t respond to \/etc\/init.d\/gridengine-exec stop without java. Also, if you&#8217;re doing an apt-get purge gridengine-* to get back to a fresh slate, typically the execd will not be stopped properly, despite being removed from the system. This can be fixed by:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">root@KWIAT22:~# ps aux |grep sge\r\nsgeadmin 22244 0.0 0.0 135172 4940 ? Sl 17:42 0:00 \/usr\/lib\/gridengine\/sge_qmaster\r\nsgeadmin 24272 0.0 0.0  58688 2500 ? Sl May16 0:22 \/usr\/lib\/gridengine\/sge_execd\r\nroot@KWIAT22:~# kill 24272\r\nroot@KWIAT22:~# \/etc\/init.d\/gridengine-exec start\r\nroot@KWIAT22:~# \/etc\/init.d\/gridengine-master restart\r\n * Restarting Sun Grid Engine Master Scheduler sge_qmaster<\/pre>\n<p>The logfiles we can use for tracking down problems in communication between the qmaster and execd processes are not in the standard debian\/ubuntu locations. Instead, they are stored in \/var\/spool\/gridengine\/execd\/messages for the qmaster and \/tmp\/execd_messages.[pid] or \/var\/spool\/gridengine\/execd\/messages for the execd processes. The log messages for our previous socket problem look like this (\/tmp\/execd_messages.24107):<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">05\/16\/2011 20:17:16|  main|KWIAT22|E|communication error for &quot;KWIAT22\/execd\/1&quot; running on port 6445: &quot;can't bind socket&quot;\r\n05\/16\/2011 20:17:17|  main|KWIAT22|E|commlib error: can't bind socket (no additional information available)\r\n05\/16\/2011 20:17:45|  main|KWIAT22|C|abort qmaster registration due to communication errors\r\n05\/16\/2011 20:17:47|  main|KWIAT22|W|daemonize error: child exited before sending daemonize state<\/pre>\n<p>If you see any lines containing |E| then you have an error that must be addressed. Any lines with |W| are warnings, and it&#8217;s probably wise to fix those too.<\/p>\n<p>On the exec nodes:<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">apt-get install gridengine-exec<\/pre>\n<p>Configure SGE automatically? yes<br \/>\nSGE cell name: default<br \/>\nSGE master hostname: KWIAT22<\/p>\n<p>After installing, you will see the following error in the \/tmp\/exed_messages.[pid] file and the process will exit:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">05\/18\/2011 17:53:00|  main|caleb|E|getting configuration: denied: host &quot;caleb&quot; is neither submit nor admin host\r\n05\/18\/2011 17:53:05|  main|caleb|C|can't get configuration qmaster - terminating<\/pre>\n<p>This occurs because the master doesn&#8217;t yet know about the exec node. We need to set up a basic configuration on the master. We will use the documentation in \/usr\/share\/doc\/gridengine-common\/README.Debian, which I will duplicate here, to form the basis of our configuration:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">Once you've installed SGE, you'll need to do at least some minimal\r\ncluster configuration.\r\n\r\nQuickstart\r\n==========\r\n\r\n * Install gridengine-master, gridengine-exec and gridengine-client\r\n   on the appropriate hosts.\r\n\r\n * Initially, only the sgeadmin user has admin privileges\r\n\r\n * It is suggested that you add yourself as a manager and\r\n   perform the rest of these tasks as your own user:\r\n   + sudo -u sgeadmin qconf -am myuser\r\n\r\n * and to a userlist:\r\n   + qconf -au myuser users\r\n\r\n * Add a submission host:\r\n   + qconf -as myhost.mydomain\r\n\r\n * Add an execution host:\r\n   + qconf -ae\r\n   You will now be prompted for information about the execution host.\r\n\r\n * Add a new host group:\r\n   + qconf -ahgrp @allhosts\r\n\r\n * Add the exec host to the @allhosts list:\r\n   + qconf -aattr hostgroup hostlist myhost.mydomain @allhosts\r\n\r\n * Add a queue:\r\n   + qconf -aq main.q\r\n\r\n * Add the host group to the queue:\r\n   + qconf -aattr queue hostlist @allhosts main.q\r\n\r\n * Make sure there is a slot allocated to the execd:\r\n   + qconf -aattr queue slots &quot;&#x5B;myhost.mydomain=1]&quot; main.q\r\n\r\n * Running qstat -f should then show you the execd waiting for jobs<\/pre>\n<p>The commands that I ran in my example:<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">sudo su\r\nsudo -u sgeadmin qconf -am rwh\r\nexit\r\nqconf -au rwh users\r\nqconf -as KWIAT22\r\nqconf -ahgrp @allhosts  # just save the file without modifying it\r\nqconf -aattr hostgroup hostlist KWIAT22 @allhosts\r\nqconf -aq main.q # just save the file without modifying it\r\nqconf -aattr queue hostlist @allhosts main.q\r\nqconf -aattr queue slots &quot;4, &#x5B;KWIAT22=3]&quot; main.q # 4 by default for all nodes, 3 specifically for KWIAT22, which leaves 1 of the 4 cpus free for the master process<\/pre>\n<p>we then add caleb as a submit and exec host:<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">qconf -as caleb\r\nqconf -ae # change the hostname entry to caleb\r\nqconf -aattr hostgroup hostlist KWIAT22 @allhosts<\/pre>\n<p>Once this is done, we need to start the execd on caleb<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\/etc\/init.d\/gridengine-exec start<\/pre>\n<p>Check that it doesn&#8217;t create a log file in \/tmp\/execd_messages.[pid]. If it doesn&#8217;t then it&#8217;s happy! Back on our master node, a qstat -f should now show us all set up. You can use the GUI qmon tool to get a better look at the setup. To use qmon, you must ssh to the master node with X11 forwarding enabled:<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">ssh -X hostname\r\nqmon<\/pre>\n<p>Click the queue control button and then the Hosts tab. If the exec nodes are communicating properly with the master, you should see them listed there, and they should NOT have dashes for the information columns. If a node does show dashes, it&#8217;s not communicating correctly, and you&#8217;ll need to go look in the log files for the reason. Note that if java is not installed, the communication between nodes will not work, and this may or may not show up in the log files.<\/p>\n<div id=\"attachment_174\" style=\"width: 744px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/helms-deep.net\/~rwh\/blog\/wp-content\/uploads\/2011\/06\/qmon-fail-communication.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-174\" class=\"size-full wp-image-174\" title=\"qmon failing to communicate with the exec node\" src=\"http:\/\/helms-deep.net\/~rwh\/blog\/wp-content\/uploads\/2011\/06\/qmon-fail-communication.png\" alt=\"\" width=\"734\" height=\"216\" srcset=\"https:\/\/helms-deep.net\/~rwh\/blog\/wp-content\/uploads\/2011\/06\/qmon-fail-communication.png 734w, https:\/\/helms-deep.net\/~rwh\/blog\/wp-content\/uploads\/2011\/06\/qmon-fail-communication-300x88.png 300w\" sizes=\"auto, (max-width: 734px) 100vw, 734px\" \/><\/a><p id=\"caption-attachment-174\" class=\"wp-caption-text\">Failure to communicate with the exec node, caleb<\/p><\/div>\n<div id=\"attachment_175\" style=\"width: 744px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/helms-deep.net\/~rwh\/blog\/wp-content\/uploads\/2011\/06\/qmon-success-communication.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-175\" class=\"size-full wp-image-175\" title=\"qmon successful communication\" src=\"http:\/\/helms-deep.net\/~rwh\/blog\/wp-content\/uploads\/2011\/06\/qmon-success-communication.png\" alt=\"\" width=\"734\" height=\"216\" srcset=\"https:\/\/helms-deep.net\/~rwh\/blog\/wp-content\/uploads\/2011\/06\/qmon-success-communication.png 734w, https:\/\/helms-deep.net\/~rwh\/blog\/wp-content\/uploads\/2011\/06\/qmon-success-communication-300x88.png 300w\" sizes=\"auto, (max-width: 734px) 100vw, 734px\" \/><\/a><p id=\"caption-attachment-175\" class=\"wp-caption-text\">    Success communicating with the exec node, caleb, after installing and selecting Sun Java.<\/p><\/div>\n<p>Next, we need to set up a <strong>parallel environment<\/strong>. This will allow gridengine to start processes on the remote exec nodes. We can do this with qmon, though it&#8217;s also possible with the CLI tools. In qmon, click the bottom left button, <strong>Parallel Environment Configuration<\/strong>. Click <strong>Add<\/strong>. In our example, we&#8217;re setting up the simplest form of parallel environment, which doesn&#8217;t include any message passing functionality. Set the Name to <strong>simple_pe<\/strong>. In our case, we have two 4-core machines, with one core reserved for the master process, so we have 7 slots. The rest we leave as default values, just click <strong>OK<\/strong>, then click <strong>done<\/strong>. Now click the top, second from the left, button <strong>Cluster Queues<\/strong>. On the Cluster Queues tab, click main.q, then click Modify. Click the parallel environment tab, then click simple_pe and move it over to the referenced PEs box. Click OK, and Done.<\/p>\n<p>Lastly, you need to <strong>set up passwordless ssh access<\/strong> from the master node to the exec nodes for the users of the gridengine system. This is left as an exercise for the reader, but you might start with learning about <a title=\"OpenSSH key management\" href=\"http:\/\/www.ibm.com\/developerworks\/library\/l-keyc\/index.html\" target=\"_blank\">OpenSSH key management<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve had Sun GridEngine running on our cluster of 12-core HP blades from its earliest days. What has not been working is the the inter-host communication (the ability of the system to schedule and distribute jobs across the nodes). I &hellip; <a href=\"https:\/\/helms-deep.net\/~rwh\/blog\/?p=159\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[5],"tags":[47,39,46,45,42,43,37,48,40,41,49,50,38,36,44],"class_list":["post-159","post","type-post","status-publish","format-standard","hentry","category-howtos","tag-apt","tag-cluster","tag-deb","tag-debian","tag-environment","tag-grid","tag-gridengine","tag-maverick","tag-parallel","tag-parallelenvironment","tag-qconf","tag-qmon","tag-sge","tag-sun","tag-ubuntu"],"_links":{"self":[{"href":"https:\/\/helms-deep.net\/~rwh\/blog\/index.php?rest_route=\/wp\/v2\/posts\/159","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/helms-deep.net\/~rwh\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/helms-deep.net\/~rwh\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/helms-deep.net\/~rwh\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/helms-deep.net\/~rwh\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=159"}],"version-history":[{"count":14,"href":"https:\/\/helms-deep.net\/~rwh\/blog\/index.php?rest_route=\/wp\/v2\/posts\/159\/revisions"}],"predecessor-version":[{"id":167,"href":"https:\/\/helms-deep.net\/~rwh\/blog\/index.php?rest_route=\/wp\/v2\/posts\/159\/revisions\/167"}],"wp:attachment":[{"href":"https:\/\/helms-deep.net\/~rwh\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=159"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/helms-deep.net\/~rwh\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=159"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/helms-deep.net\/~rwh\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=159"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}