The Toowoomba Cluster (ODIN)
The Toowoomba cluster uses
Condor to manage a queue of jobs that are run as needed on a 48 node (12 x 4-core) cluster. Direct access to the cluster (via the condor program) is only possible when a workstation is on the same network, and as most workstations are some distance from this, a "postbox" mechanism is provided that allows users to submit jobs and collect output.
There are two forms of jobs that the postbox monitor on the apsru cluster looks out for:
- old style ".sub" jobs, and
- new style zipfiles (created by the Apsim UI).
The current state of the pool can be seen via the web pages at
http://odin/condor/.
Toowoomba jobs via Disk Shares
APSIM Initiative members connected to DDCNet can drop old style ".sub" jobs directly to a
\\odin\CondorPostbox shared folder. Note that condor will submit the job as soon as it discovers the .sub file; so copying jobs should be done in two passes: the data first, and the .sub files last.
Another disk share,
\\odin\CondorDropBox is monitored for new style ".zip" jobs created by the Apsim UI.
Toowoomba jobs via a DropBox share
Users without access to DDCNet must put files on a
DropBox share. Dropbox allows transparent sharing of files across the internet. We can use Dropbox to transfer files between our local workstations and the APSRU cluster. To enable the service, we need to share a folder from our workstation with the cluster by following these steps:
- Create a directory in your Dropbox on your workstation. Use a name that you & others can recognise as yours.
- Right click on it, select the "Dropbox" menu and "Share this Folder".
- Share it with "peter.devoil@deedi.qld.gov.au" (whose name shows up as "Apsim JV")
- Open a remote desktop to odin, find the dropbox icon in the system tray and "Launch dropbox website". This will be logged into the dropbox system as "peter.devoil@deedi.qld.gov.au".
- There will (maybe after a while) be a red notification in the sharing menu on this website; your invitation from step 3. "Accept" it, so that files you drop into the dropbox on your workstation are shared with the dropbox on odin.
- Should you be curious, this same area can be seen as a disk share somewhere under "\\odin\ExternalDropbox\My Dropbox".
Using the APSIM User Interface tool to create jobs
To submit jobs to this system you must:
- In ApsimUI, select the Run on Cluster button
- You have the choice of running all simulations in the currently open .apsim file OR all .apsim files in a directory on your hard disk.
- The cluster has a stock Apsim7.3 installation installed in Apsim73-r1387: use this as the "APSIM version".
- Enter the destination directory of the share where you're going to put the files.
If you're using Dropbox, browse to the dropbox folder created in step 2 - something like "C:\Documents and Settings\<username>\My Documents\My Dropbox\<unique share name>"
If you're using a disk share, use something like "\\odin\CondorDropbox\<username>"
- If you're using DropBox you should tick the Zip all files check box so that a .zip file is created.
- If you have a directory of .apsim files, each with a single simulation in them, then it is quicker to run then as a .apsim file and not have them convert to .sim file, so untick the Convert to .sim files box.
- Wait for the job outputs to appear in the destination directory. Dropbox will make annoying popping noises when the transfer is complete.
 |
The output of the job will appear in the same place as the input, zipped up with an ".out.zip" extension.
Using Amazon's EC2 cloud to run APSIM jobs
Amazon's
EC2 service can be configured to run a condor cluster - though its pricing policy favours Linux hosts ahead of Windows. Recent (post 7.1) versions of Apsim run on 32 bit unix platforms without trouble;
r1535 is recommended for use with Condor.
Given the high network latency between the cloud and our workstations, it's preferable to run the entire cluster inside the cloud: a single master and multiple workers. To configure APSIM on the workers, take a vanilla AMI and install the following packages:
apt-get update
apt-get -y install libboost1.40 libboost-thread1.40.0 libboost-date-time1.40.0 libboost-filesystem1.40.0 libboost-regex1.40.0 libxml2 libc6-amd64 mono-runtime libgfortran3 tcl8.4 tcllib tdom
The Apsim binaries can be unpacked under /opt. Condor binaries are available from the condor website.
When an EC2 host is started, it is behind a NAT firewall and has a dynamic IP address; it knows very little about the world around it. The condor daemons need to be told which pool they belong to and where to find them. To dynamically configure condor, run this
shell script at boot time to point the condor master & workers at the host nominated in the AMI user data field. Take a snapshot of the AMI at this point - it's ready to be started.
This
shell script starts one pool master and multiple workers using the
EC2 toolset. Alternatively, the instances can be managed from the AWS control panel, provided that the "user data" field for workers is set to the internal DNS of the pool master. In addition, it's important to open many ports (9618 and 40000-41000) for the condor daemons - each "job" requires 5 open ports, so the master must have 5 times the number of worker machines. This is done by creating a security group chosen when the instance is created.
Note that UDP will fail inside the cloud, so specify "UPDATE_COLLECTOR_WITH_TCP = True" in your condor config.
To save you time, Peter deVoil (peter.devoil@deedi.qld.gov.au) can share AMIs with Apsim and Condor installed and configured as above.