Xgrid Leopard: the good, the bad, the ugly, and the new stuff

With Leopard came Xgrid 2, with much improved performance and a few new features that many Xgrid users will probably find useful. In this article, I will try to go through all the new things in Xgrid, at the controller, agent and client levels. You might also be interested in the companion article about Scoreboard.

Documentation: the good, the bad and the ugly

I will start where it hurts. While Xgrid is certainly THE simple solution for distributed computing, a little documentation can not hurt. Unfortunately, with Xgrid, while there is some good, there is also some bad and some ugly.

Apple provides a very detailed guide on Xgrid administration using Mac OS X.5 Server. This guide existed with Mac OS X.4, and remains a must-read. The guide has been updated to include instructions and details about the new features in the controller side of Xgrid. This is good.

Where things get more shady is in the coverage of the client side of Xgrid, which is the part where you submit jobs. There are 2 ways you can submit jobs: using the command-line tool xgrid, or using the Cocoa APIs supported by XgridFoundation.framework. Both can use a special plist-formatted file for job submission, also known as job specifications. Between the xgrid man page, the XgridFoundation headers and the developer examples, one was able to figure out how to build job specifications and submit jobs in Tiger . But not without some hard work and lots of debugging. This was quite bad.

With Leopard, things got ugly... The Leopard Server press release delivered a very enticing message about 'Xgrid 2 featuring GridAnywhere for ad hoc distributed computing in environments without dedicated controllers, and Scoreboard for prioritizing job distribution to the fastest available CPU'. The Xgrid home page talks again about Scoreboard... but no mention of GridAnywhere. I tried to google 'GridAnywhere' or 'Scoreboard Xgrid', all I got was a few hundred hits rehashing the Leopard Server press release. I also checked the Developer examples, the XgridFoundation headers and the developer documentation. The only information was that you can call 'defaultController' and 'privateController' on the XGController class, and there were a series of keys that are new to Leopard:

XGJobSpecificationARTConditionsKey
XGJobSpecificationARTDataKey
XGJobSpecificationARTEqualKey
XGJobSpecificationARTMaximumKey
XGJobSpecificationARTMinimumKey
XGJobSpecificationARTSpecificationsKey
XGJobSpecificationSchedulerHintsKey

ART? WTF? The mystery was getting thicker. How about a look at the new xgrid man page? It is the same as in Tiger, except for the addition of a few new options in the job submission arguments (shown below in bold), but with no explanation whatsoever of what they do!


xgrid -job submit [-gid grid-identifier] [-si stdin] [-in indir]
[-dids jobid [, jobid]*] [-email email-address]
[-art art-path -artid art-identifier] [-artequal art-value]
[-artmin art-value] [-artmax art-value]
cmd [arg1 [...]]

Mystery 1 = GridAnywhere, Mystery 2 = Scoreboard, Mystery 3 = ART, Mystery 4 = default and private controllers. To me, this looked more and more like an episode of Lost. There are apparently new useful Xgrid features in Leopard, but they are not documented anywhere...

While this would seem to put an end to this article, these features are fortunately explained in more details in the video of the Xgrid session of WWDC 2006, accessible to WWDC attendants, and ADC Select or Premium membership. It might even be acessible to free ADC account members, since this is from the "old" WWDC 2006, and Leopard is out. If you have access to it, watch it, the talk is good and useful. With some testing and extrapolations, it was then easier to figure out what all these things mean. So, without further whining, let's get started.

The Xgrid Controller in Leopard

Performance. Here at MacResearch, we use Xgrid for the OpenMacGrid project. One of the first thing we noticed with the Xgrid controller in Leopard was a very obvious boost in performance. With Xgrid 1.0 on Tiger, things were very stable, but we had occasional crashes of the controller. We also had to occasionally clean up the controller database and remove all the cruft that seemed to bring the controller to a complete stop after too many jobs. We also had noticed that the controller would often get really busy, which could make the connections very slow, for instance with Xgrid Admin. All of this is gone with Xgrid 2.0 and Leopard: no crashes, no lockups, no slow connections anymore. The Apple engineer(s) working on Xgrid really did a really good job at improving performance and stability.

Backward-Compatibility. Importantly, the Xgrid 2.0 controller remains compatible with agents running Xgrid 1.0, so you can keep using agents running Max OS X.3 Panther and Mac OS X.4 Tiger. To some extent, Xgrid Leopard is even more backwards-compatible than Xgrid Tiger, thanks to Scoreboard (which I will cover later in this article).

Shared Filesystem. Many of you run the Xgrid Controller using Mac OS X Server, which adds a nice user interface to the setup, and allows you to incorporate Xgrid with other OS X services, in particular Kerberos authentication and a shared filesystem. Leopard Server comes with a new configuration Assistant, that makes it even easier than before to setup Xgrid using Server Admin. In particular, it appears to be much easier to make Xgrid works securely with a shared filesystem (also administered by your OS X Server box). Basically, you get a shared filesystem mounted on the agents, right there in /Network/Xgrid. A lot of these features are explained in more details in the Xgrid administration guide. I do not have a setup with a dedicated cluster of agents and a shared filesystem in place to explore all of these possibilities and even confirm or test these. But if you do, I would really encourage you to take advantage of the easy and secure integration offered by OS X Server: use a shared filesystem, since one major limitation of Xgrid is in moving files around. A quick warning, though: Leopard has added more layers of security, and it may cause some problems to get things working the way they might have been working in the past, for instance the Leopard sandbox feature can prevent the xgrid agent from accessing an NFS share.

Command-line Administration. As before, the controller can also be started/stopped from the command-line, using the xgridctl command. Not much change there, except that the "on" and "off" options are deprecated, which is a welcome change. This means the command ''xgridctl c start' will not only start the controller, but also make it stick the next time you start the computer (and the controller will restart if the process crashes). I also noticed that the output of the command "xgridctl status" has changed. In Tiger, when the controller is stopped, the command returns no information at all, just the prompt. With Leopard, the status of the controller or of the agent daemon is always shown, making it clear that indeed your controller is stopped. Nice touch.

Xgrid on the Client. The command xgridctl is also still available on the client version of Mac OS X, which means you can still set up your home-grown xgrid cluster in a garage without buying the OS X Server version, just as easily as it could be done in Tiger. You may also still use the free and open-source Xgrid Lite preference pane to setup a password and control the Xgrid service using a graphical user interface.

Xgrid Admin. Another piece of software associated with Xgrid administration is Xgrid Admin, that sees a small update and is still packaged with the Server Admin Tools (note: the URL for the Tiger Server Admin Tools has changed). This is a free download that allows you to manage agents and jobs, even remotely using a non-server OS X Leopard machine. The Xgrid Admin for Leopard looks very much the same as the Xgrid Admin for Tiger. No cover flow, no fancy animations, no flying windows. Just one great improvement: you can now select several jobs or several agents AT THE SAME TIME!! This is very convenient when you want to delete more that one job at a time... Unfortunately, the search field is still case-sensitive, which means you really need to type "Steve" and not "steve" when looking for Steve's computer.

Podcast Producer. Finally, it is nice to see Apple releasing for the very first time an application that takes advantage of Xgrid: Podcast Producer. It comes bundled with Mac OS X Server, and "simplifies the process of recording content, encoding, and publishing podcasts for playback in iTunes and on iPod, iPhone, and Apple TV". Podcast Producer is able to use Xgrid for large-scale podcast productions. Encoding tasks are then automatically distributed to your agents. It requires a shared file system such as Xsan or NFS, which is necessary to move those big video files efficiently. Podcast Producer is not going to find a cure for cancer, but at least it is Xgrid-aware.

The Xgrid Agent in Leopard

In Xgrid, the agent is the computer running the task that the client needs, and that the controller schedules. Any computer running Mac OS X.3, X.4 or X.5 can become an agent. Like in Tiger, the agent for Leopard can be setup using the Sharing preference pane. This pane has seen some important changes in the user interface, and Xgrid benefits from it. The Authentication Method is directly accessible, while the other settings are still accessible only when clicking the Configure button:

Just like the controller, the agent daemon can be setup using the xgridctl command-line tool, using 'xgridctl a start' or 'xgridctl a stop' commands.

The Xgrid Client in Leopard

Before I go into the details, I want to start by restating what constitutes the "Client" part of Xgrid. The Client is the part that submits the jobs, which is also the part where you will spend more time working in, as a scientist (you might spend some time initially on the controller and agent sides, but hopefully, once this is figured out, there should be minimal work to be done on these). On the client side, Apple offers two tools:

  • The xgrid command-line tool, that you can for instance integrate into a non-graphical scripted workflow
  • The Cocoa APIs, that you can use to build custom programs, either command-line tools, or full-fledged OS X applications with a nice graphical user interface (and even some flying windows and coverflow animations)

As I mentioned in the first section talking about Xgrid documentation, there are several hints in both tools regarding some of the new features available to the client. I will cover Scoreboard separately in the next section. But first, let's talk about GridAnywhere.... GridAnywhere has 2 sides to it: "Xgrid Here" and "Xgrid There", a.k.a. "private controller" and "default controller". These 2 features are really Leopard-specific and Client-specific, which means they require Mac OS X Leopard on the client side, but do not require to run the controller in Leopard.

Private Controller.
You might be wondering: how can a private controller be a feature of the Xgrid client?? Well, in fact, this is exactly the idea: you can get a controller, even if you don't have any! Instead of accessing a normal Xgrid Controller using the hostname of the machine running it, you can use the special hostname ':private:' to access a temporary controller and agent, that will act just like a regular controller, but is not connected via the network. The jobs you submit to this controller are run using your computer as the agent, and there is no way to have more machines attached to it, since it is not connected to the network.

How do you use that feature? It is very easy. Take any existing client application, for instance Xgrid Admin or GridStuffer (please upgrade to GridStuffer 0.4.7 before using GridStuffer on Leopard). In the field where you would normally type the hostname, type instead ':private:' and you get a grid with 1 machine (yours), ready to accept and run jobs. This works without writing any new code, even if the app was built in Tiger and has not been modified: for instance, for GridStuffer, I did not have to do anything special, it was just there!

Why is that useful? I can see at least 2 reasons why you would want such a feature. First, for the developer of an Xgrid client application. You can test Xgrid functionality without having access to an Xgrid controller, or without firing up an Xgrid controller on your local machine. Second, for both the developer and the user. Typically, an application may want to provide a way to distribute tasks to the cores of your local machine, and at the same time offer the possibility to distribute them to a cluster of machines, via Xgrid. With Tiger, the developer would have to code for the 2 different cases. With GridAnywhere, it can all be handled by Xgrid. For the user, that means the application can be Xgrid-aware, and take advantage of some of Xgrid features, but without requiring an Xgrid controller. This is a great way to keep those cores busy, and makes for a good intermediate solution between NSOperation and a full-blown Xgrid setup.

The user of the application does not even have to type the somewhat cryptic ":private:" string in a text field. Access to the private controller is also directly provided with the ''-privateController' method in the Cocoa APIs (see the XGController header).

There are some pitfalls and limitations to be aware of when using this private controller. It is very important to understand that this is not the same as having a controller running locally on your machine (e.g. as you would have by calling 'xgridctl c start'). The private controller will only have access to one agent, and that is the machine on which the Xgrid client application is running. The private controller only lives in the RAM, with no persistent store on your hard drive, and will only exist for the duration of the Xgrid session. If you quit the application, or disconnect the controller while it is running jobs, all will be lost. In addition, it is important to realize that each application will start its own little private controller, and they won't be connected to each other. For instance, if you submit some jobs on the private controller in GridStuffer, you will not see these jobs in Xgrid Admin (in fact, accessing the private controller in Xgrid Admin is pretty much useless). This is illustrated below:

The same is true of the xgrid command-line tool. You can use the ':private:' string for the hostname. But if you use the 'submit' option for the job submission, the tool will submit, then quit, and the job is basically lost. If you use the 'run' option for the job submission, it will work mostly as expected. Here is for example what your Xgrid sessions might look like in the Terminal:

>xgrid -h :private: -job run /bin/echo 'bonjour'
bonjour
>xgrid -h :private: -job submit /bin/echo 'bonjour'
{
    jobIdentifier = 0;
}
>xgrid -h :private: -job attributes -id 0
{
    error = InvalidJobIdentifier;
}
>xgrid -h :private: -job list
{
    jobList =     (
    );
}

Default Controller.
In Mac OS X Leopard, a hidden user defaults is also available to set up a "default controller" (which defaults to the private controller if no default controller has been defined). At this point, this default controller is only available through the Cocoa APIs, and require adding code to take advantage of it. To access the default controller, you must call the method -defaultController in the XGController APIs.

It seems this feature might be for instance aimed at large grid with several clients using the cluster for their computation. Once implemented in such a workflow, changes in the default controller could be pushed to the client machines as needed, using the command-line (there is no GUI for it). The defaultControllerName key is used to define the name of the controller that should become the default controller. The resolveNameAsNetService key should be YES if the defaultControllerName corresponds to a local controller proposing its service via Bonjour, and NO for a remote host. Both keys should be set in the com.apple.xgrid.foundation domain:

defaults write com.apple.xgrid.foundation defaultControllerName openmacgrid.macresearch.org
defaults write com.apple.xgrid.foundation resolveNameAsNetService -bool NO

Conclusion

While Xgrid documentation is still lagging behind, or even simply missing, Apple has done a very good job at improving performance and adding a few useful features. One of the most interesting feature is covered in a companion article: Scoreboard rules!. Hopefully, with these MacResearch articles on Xgrid Leopard, the lack of documentation will become less of an issue, and Xgrid 2.0 will be put to good use before the next version of Mac OS X.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

ERROR: timeout starting xgridcontrollerd

When trying to start xgridctl on my machine, uisng this command "sudo xgridctl c start" (basic version of Leopard, 10.5.4), I get this error message. Has anyone else encountered this in Leopard? If so, is there a fix?

Thanks,

Ryan

Starting Xgrid controller on Leopard client

Ryan,
I had a hack a while back & got the Xgrid controller going on Leopard client (in password auth mode),
here's a summary for you.

1) Fire up System Preferences and configure your Xgrid sharing prefs.
Set controller to "localhost", authentication method as "password" and set a password of your choice. This will set /etc/xgrid/agent/controller-password (password the agent uses to talk to controller) to be a hash of the password you entered. Thankfully Apple use the same key to hash all the passwords, which makes the job easier ;p

2) Set the client password (password clients use to talk to controller) by copying the controller password.
sudo cp /etc/xgrid/agent/controller-password /etc/xgrid/controller/client-password

3) Set the agent password (password the controller uses to talk to agent) by copying controller password.
sudo cp /etc/xgrid/agent/controller-password /etc/xgrid/controller/agent-password

4) Start the controller
sudo /usr/libexec/xgrid/xgridcontrollerd
The terminal will spit out a whole load of warnings, telling you it's setting up databases etc.. this is normal & indicates the grid is starting up.

Eventually you'll see something like:
Wed May 28 20:19:35 Issaquah.local xgridcontrollerd[333] : Notice: controller accepted agent connection from "127.0.0.1" port "49221" (sid = 0x21dd40)
Wed May 28 20:19:36 Issaquah.local xgridcontrollerd[333] : Notice: controller agent "Issaquah" state changed to "Available"

The controller should now be live (you can view using Xgrid Admin).
Fire off a "cal" job to test
xgrid -h 127.0.0.1 -p your_password -job submit /usr/bin/cal
Note: -h = host. It didn't seem to accept "localhost" for me, but it should accept IP addresses, bonjour addresses (eg. machinename.local), and FQDNs.

You should see the this running in Xgrid Admin (it's quick so won't last long, but it'll be in the job queue).

The terminal should now have something similar to the following:
Wed May 28 20:26:51 Issaquah.local xgridcontrollerd[333] : Notice: controller job "16" created
Wed May 28 20:26:51 Issaquah.local xgridcontrollerd[333] : Notice: controller created task "0" for job "16"
Wed May 28 20:26:51 Issaquah.local xgridcontrollerd[333] : Notice: controller grid "0" submitted task "0" for job "16" to agent "Issaquah"
Wed May 28 20:26:51 Issaquah.local xgridcontrollerd[333] : Notice: controller connection closed (sid = 0x21b0c0)
Wed May 28 20:26:51 Issaquah.local xgridcontrollerd[333] : Notice: controller task "0" for job "16" state changed to "Running"
Wed May 28 20:26:51 Issaquah.local xgridcontrollerd[333] : Notice: controller job "16" state changed to "Running"
Wed May 28 20:26:51 Issaquah.local xgridcontrollerd[333] : Notice: controller task "0" for job "16" state changed to "Finished"
Wed May 28 20:26:51 Issaquah.local xgridcontrollerd[333] : Notice: controller job "16" state changed to "Finished"

If so then your controller is running successfully and you should be able to close the terminal window.

hopefully that helps,
thanks,
mark

[solved] sudo xgridctl status

Hello

On an 10.5.6 (not Server), after typing "sudo xgridctl status", I get an error message which could'nt "google":

2009-01-10 21:00:58.333 ruby[2181:c0b] RBCocoaInstallRubyThreadSchedulerHooks: couldn't find autoreleasePool ivar
daemon state pid
====== ===== ===
/Library/Frameworks/RubyCocoa.framework/Resources/ruby/osx/objc/oc_import.rb:156:in `const_missing': uninitialized constant LAUNCH_DATA_DICTIONARY (NameError)
from /usr/sbin/xgridctl:10:in `pid_for_service'
from /usr/sbin/xgridctl:71:in `print_service_status_row'
from /usr/sbin/xgridctl:131:in `print_status'
from /usr/sbin/xgridctl:187:in `run'
from /usr/sbin/xgridctl:219

What could this mean and what could be done to get it running?

Thanks!
dr3do

solution: download and install http://sourceforge.net/projects/rubycocoa/

Authentication

Thanks for the great article. And also thanks for the comment on how to set up authentication. As a try-out, I'm trying to set up a little cluster using the 6 cores of 2 iMacs and a macbook pro, in a mixed Leopard and Tiger environment.
I've been trying all the advice in this article, comment and xgrid@stanford setc etc, but the combo sudo command line start-up of the xgrid controller with copying of password files + gridstuffer giu for job submission + xgrid admin for viewing the progress and nr of agents seems a little unstable. At one point, all these elements were working together and executing one job (on one node), but the I think the controller exited and on restarting in the same way the database was closed improperly and corrupted, etc.

In short: it seems from these wonderful articles that xgrid could work wonders even on Leopard machines; but one year after these articles, there still is no decent documentation to be found on the web to set up even the most basic xgrid network. [by which I mean: no xServer, but say a couple of intel imacs and macbooks; with security (password or kerberos)]. All the more elaborate articles are from Tiger days, and can not really be extrapolated to Leopard.

I'm a scientist and life-long mac user and would love to use xgrid to expand my computational power and I'm sure there are currently more and more scientist using macs for there work, so I was hoping that maybe you guys could do a little brush-up of a xgrid tutorial. A 2009 step-by-step instruction to set up a xgrid using a couple of owned macs (i.e. access, but not necessarily on local network), securely. In my particular case, the code itself is not paralellized, but I need to run many instances of the same code with different parameters. Instances that consume plenty of CPU and RAM. It would be a huge boost if I could use other idle macs I have at my disposal to aid in these computations.

Any suggestions to get this to work would be greatly appreciated.
Thanks ,
--J

Leopard Problems

Hi J,

First of all, you have to realize that running an Xgrid server for "production" outside of OS X Server is not supported by Apple. The only supported use is for development purposes.

That said, it should work just the same as in Tiger. The database corruption problem you seem to have has not happened for us on OpenMacGrid ever since we switched to the Leopard Server. But then, we are running it on a dedicated machine running OS X Server. With your relatively small setup, it is still surprising that you have so many problems. Cleaning up the database is very straightforward, and I would encourage you to do so.

In addition, XgridLite is being updated to Leopard as we speak, check Ed Baskerville's site for the update. We will also post it on MacResearch when it comes out. It can be a good solution for the setup you describe.

Finally, I agree an updated tutorial could be nice. I will think about it :-)

charles

firewall

Hi,

Thanks for your response. I did finally get it to work. Largely due to the help of Ed Baskerville who has just updated gridlite to support Leopard. This way I could very easily reset the controller and set the passwords for client and agents. Very usefull.
One last question you might be able to help me with: the computers I got the xgrid to run on are at my office behind a firewall that I don't have access to. However, I would like to add one (or more) mac agents to the cluster on remote networks that I do have ssh access to. Would you know how to set up a ssh tunnel for the 4111 ports so that those agents can find my controller? [I found some suggestions googling, but all of them went the wrong way around, i.e. ssh-ing from the agent to the controller, which is not possible for me, as the 22 port is not forwarded to the controller computer].
I thought this should work: ssh -L 4111:remote.agent.org:4111 -l meuser -N remote.agent.org but it doesn't (bind: Address already in use
channel_setup_fwd_listener: cannot listen to port: 4111
Could not request local forwarding. )
maybe because the computer that runs the controller also runs 2 agents on itself.

Thanks again for your useful article.
--J