Xgrid Leopard: Scoreboard rules!

With Leopard came Xgrid 2, with much improved performance and a few new features that many Xgrid users will probably find useful. In this overview of Xgrid Leopard, I listed all the new things in Xgrid, at the controller, agent and client levels. In this article, I will focus on one particularly interesting feature: Scoreboard.

Scoreboard deserves its own section, because it is probably the most useful feature in Xgrid 2. Scoreboard and the mysterious "ART" acronym are actually the same feature. ART stands for "Agent Ranking Tool". In summary, an ART is an executable (or a script) that you can run on the agents, that returns a score. The Controller keeps track of all the scores, one for each agent, writes them down on its virtual Scoreboard, looks at the numbers, scratches its long white beard, and then decides who gets the job.

Why Scoreboard?

With Xgrid 1.0, when you submitted a job, you could not make any assumption about the agent(s) that would run it. Xgrid sure made it easy for you, but maybe too easy, and pretended you did not need to know anything or care about the machine running your computations. Unfortunately, while it may be true that you don't care which exact machine is running it, you may care about its specs. For instance, some computations may need to use a lot of RAM, and you don't want to waste any cycles on a poor old 256 MB G3. Or your code might only run on Intel processors. Or you might have special requirements for the GPU. One way to work around that in Xgrid 1.0 was to have some tests running as part of the job, and then bailing out if the requirements were not good enough. But then, (1) the job should then be considered 'failed', which means you needed to handle those failures gracefully at the client level and (2) the agent would instantly be 'Available' again, which means it would be instantly rescheduled for the job it just bailed out of, making this agent an evil Xgrid black hole, sucking all the jobs getting too close to it! Wrokarounds were possible, but that would require another tutorial... which I don't need to write anymore, since Scoreboard is here!

Xgrid already keeps track of the processor speed. So maybe it could also keep track of the RAM, of the GPU, of the type of processor, of the available HD, etc... The problem is this list can grow pretty fast, and who knows what the future will bring, maybe you will want to check for the presence of a flash drive, a GPS, a QPU, whenever these things are added. Rather than second-guessing the user needs or squinting at their crystal ball, the Xgrid developers came up with a simple, flexible and powerful solution. You, the user, write the script or program that will run on each agent, and that will return whatever score *you* decide.

For instance, you could decide to have a score based on the amount of RAM, by using this script:

#! /usr/bin/perl

# profiler values are in the form: '    __key__: _____value____' (one per line)
my $memory = `/usr/sbin/system_profiler system_profiler SPHardwareDataType SPSoftwareDataType grep Memory`;
my $regex = '^\s*Memory: (\d*) (.*)$';
my ( $value, $unit ) = ( $memory =~ /$regex/m );

# print RAM value in megabytes
if ( $unit eq 'MB') { print $value }
if ( $unit eq 'GB') { print $value * 1024 }

We just wrote our first Agent Ranking Tool. Tada!

Scoring and filtering

Now, here is how Scoreboard works. Together with the job description, you, the client, submit an "ART", for instance the script above. The Controller then do the following:

  • Scoring: the controller submits the ART to all the available agents, and receives the scores back for each agent.
  • Ranking: agents with a score of 0 are eliminated, and will not run any of the tasks for that job. The remaining agents are ranked based on their score
  • Working: the tasks for the job are sent to the selected agents. The agents with the highest scores are used in priority.
  • Results: the client simply recieves the results back, and does not have to care about the scoring
  • This is also illustrated in this little cartoon:

    There are a few more subtleties in the selection and in the ranking of agents.

    First, you can submit several ARTs. The Controller will submit all the different ARTs to all the agents, and will calculate a final score, which is simply the product of the scores. This allows more complex ranking. It also means that a score of 0 in any of the ARTs will result in a final score of 0, no matter what the scores for the other ARTs are. An ART that returns only 0 or 1 is thus a very simple way to separate agents in 2 groups. For instance, a first ART might return the memory, and a second ART might return 1 if the processor is Intel, and 0 if PPC. The final score would thus be 1000 for an Intel processor machine with 1 GB of memory, and 0 for a PowerMac G5 with 4 GB.

    The second tool that comes with ART are "Conditions". In what we have seen so far, the only way to eliminate an agent is to assign a score of 0. That means your ART script has to be written so that it returns 0 on certain conditions. If these conditions were slightly different for a different job, you would need to rewrite the script. Conveniently, Scoreboard provides an alternative, and includes some basic arguments to set conditions other than elimination of zero-valued agents. If you look again at the xgrid command-line syntax, you will see you can instruct the Controller to select agents with a score in a specified range (min, max or equal):


    xgrid -job submit [-gid grid-identifier] [-si stdin] [-in indir]
    [-dids jobid [, jobid]*] [-email email-address]
    [-art art-path -artid art-identifier] [-artequal art-value]
    [-artmin art-value] [-artmax art-value]
    cmd [arg1 [...]]

    The ability to change the conditions independently of the ART script itself, means that you can reuse your scoring scripts. In fact, it allows you to build a library of ART scripts that you may use for different jobs, only changing the ART conditions as needed for your different computations. And let Scoreboard do the rest:

    Scoreboard specifications

    Let's now see how to include ART arguments into the job specifications. For testing purpose, I have written 2 useless scripts that will be our "ART" script and our "task" script. Both script take the Computer name, for instance "Steve's MacBook" and do the following:

    * The ART script returns a number that ranks it based on alphabetical order (on the first 2 letters)
    * The task script prints the first 5 letters of the name, then computes again the ART score and prints it too

    Thus, we expect the tasks to be assigned to the different agents based on the position of their names in the alphabet (I warned you, these scripts are useless). Agents starting with "z..." will get the tasks first, agents starting with "A..." last (note that I am actually using here the ASCIIbetical order).

    The first possibility is to use the command-line format, assuming you have both scripts in the current directory, and that you have defined the controller and password using the appropriate environment variables:

    xgrid -job submit -art omg_art_test.pl -artid name_score omg_task_test.pl
    

    This will only submit one task, and will only run on 1 machine. A more powerful alternative is to use instead the batch submission format, which should include the following new keys as part of the job specifications:

    artConditions = {
        identifierART1 = {
            artEqual = xxxx;
            artMin   = xxxx;
            artMax   = xxxx;
       };
        identifierART2 = {  };
        ...
    };
    artSpecifications = {
        identifierART1 = {
            artData = <2321202f ...>;
        };
        identifierART2 =  { ... };
        ...
    };
    

    You may want to also have a look at a working example that I am using here for testing. It seems it is important to have both the artSpecifications and the artConditions, even if you do not apply any conditions (in which case you just leave the conditions empty, as in the identifierART2 example above). Each ART has an identifier (for instance name_score or identifierART1), that you use as a key in both artSpecifications and artConditions dictionaries. The ART identifier key should be assigned a dictionary value. There are 3 possible keys for dictionaries listed in the artConditions dictionaries, and only 1 key for the artSpecifications dictionaries. For instance, the artData key is used to provide the data for the script or executable that does the scoring, in base64 format.

    For testing, I tried these job specifications and these scripts on our OpenMacGrid cluster, either without conditions (left column), or restricting the score to the 20000-25000 range (right column). The results have been edited a little bit but this is more or less what I got:

    	uma  score = 30061      
    	bbt  score = 25186      
    	ada  score = 24932      ada  score = 24932
    	Tod  score = 21615      Tod  score = 21615
    	Syn  score = 21369      Syn  score = 21369
    	Sev  score = 21349      Sev  score = 21349
    	PHY  score = 20552      PHY  score = 20552
    	OMG  score = 20301      OMG  score = 20301
    	CIM  score = 17225
    

    Maybe next time I will have examples with some useful scoring scripts, but for now, ASCIIbetical order rules!

    Backward compatibility.

    The Scoreboard feature may appear to be a new feature of the Xgrid client, but keep in mind that the actual implementation is at the Controller level. You absolutely need to run the Controller on Mac OS X Leopard (the ART settings will be gracefully ignored by the 10.4 Xgrid Controller). Interestingly, thanks to the plist format used for submission, you do not necessarily need to run the client on Mac OS X Leopard, so you can still submit a Scoreboard-aware job from Tiger. All you need is to add the appropriate keys listed above and use the batch format for submissions. However, the Scoreboard features are only directly available with the Leopard version of the command-line tool xgrid, using the now obvious arguments -art art-path -artid art-identifier -artequal art-value -artmin art-value -artmax art-value. Similarly, the ART keys used in the batch format are only officially available in the Leopard Cocoa APIs (but you can easily make the code Tiger-compatible by providing your own NSString with the right values).
    For the Agent, none of that matters. ART script are actually no different from normal tasks, from the agent perspective. Thus, Scoreboard will work with any type of agent, including Panther agents. Note however, that different versions of Mac OS X will not have all the same features and commands available. In particular, the /usr/sbin/system_profiler command is more limited in Mac OS X.3.

    Conclusions

    Scoreboard is a great addition to Xgrid. The implementation is flexible and powerful, with real-world applications well addressed. It will be easy to add to your current workflow without disrupting your existing setup. In fact, it will probably make a lot of headaches go away if you have an heterogenous collection of agents.

    AttachmentSize
    omg_specs_test.plist_.txt4.24 KB
    omg_task_test.pl_.txt573 bytes
    omg_art_test.pl_.txt409 bytes

    Comments

    Comment viewing options

    Select your preferred way to display the comments and click "Save settings" to activate your changes.

    More details about ART

    Hi,
    How exactly does Xgrid keep track of the scores? Does it cache them somehow or will it send out a new ART to each worker for each new job, even though the ART is the same?
    Also, on Leopard (10.5.1) , man xgrid says nothing about the art, is this the expected behavior or am I missing something?

    Cheers,
    Yi

    http://yiqiang.org

    ART and Xgrid

    To answer your questions:

    * Xgrid does not cache the ART results, the scripts get sent again for every job. I think it is sent only to the available agents, and then only once per agent (as opposed to tasks: several can be sent for one agent, if it has several CPUs).

    * ART is not documented in the man page (see the other post on Xgrid Leopard, where I complain about the lack of documentation). The arguments for the command-line are listted in the synopsis of the man page, but then not explained anywhere in the rest of the text!!

    charles

    API update?

    Hi Charles,

    Is there any news if the API will be updated to interact with Scorecard? OpenMPI can use xgrid, but so far as I can tell there is no mechanism to tell Xgrid to use particular nodes. If the OpenMPI developers could hook into scorecard maybe it could be done.

    Jody

    ART to select nodes

    I am not sure with OpenMPI, but in theory, Scoreboard and ART can be used to tell Xgrid to use specific nodes. The ART script can inquire the IP address of the agent it is running on, and then compare it to a list of addresses that may be embedded in the script or queried on a web server (then it's dynamic and can even have some server side processing by POST-ing the name of the job). Then the script returns a score of 1 (match) or 0 (no run).

    Doing Parallel Computation

    Hi.
    I starting to work with Xgrid, previously I worked on PVM and I'm trying to develop some scripts that can share works between machines on a Virtual net.
    The fact is that I need to know when a "task" or "job" is finished to send a new event to the queue. If somebody knows about a good manual, or a guide for parallel computation with Xgrid
    Thanks

    Dependent tasks

    Xgrid has some simple mechanisms to define dependencies between jobs or tasks, so that a job or task is only run after a certain job or task has finished. Check the man page for the xgrid command line, it has an example of plist where the structure for that is well defined

    More info about that format:

  • Xgrid tutorial 3
  • http://www.kellerfarm.com/kfsproducts/yesfree/xgridbatcheditor/index.html
  • Just for clarification I

    Just for clarification I note that Scoreboard does NOT RANK the scores. The scores are a threshold. If it passes any agent is as good as another. Also the artequal and (artmin + arrtmax) are mutually exclusive settings. Finally there are a lot of pitfalls in using ART scripts. These are documented in the xgrid wiki at TenGrid.com

    li

    Learn about cluster computing on macs at http://tengrid.com