The Toowoomba Cluster
The Toowoomba cluster uses Condor
to manage a queue of jobs that are run as needed on a ~300 core cluster. Direct access to the cluster (via the condor program) is only possible when a workstation is on the same network, and as most workstations are firewalled from this network, a "postbox" mechanism is provided that allows users to submit jobs and collect output.
The pool can be accessed by web pages at https://apsrunet.apsim.info/
. A username and password is required. Jobs can be uploaded by the Apsim GUI, the same web pages, or old style ftp access.
Using the APSIM User Interface tool to create jobs
To submit jobs to this system you must:
- In ApsimUI, select the Run on Cluster button
- You have the choice of running all simulations in the currently open .apsim file OR all .apsim files in a directory on your hard disk.
- The "Location of self extracting APSIM" entry is how a specific version of apsim is run. The default url will be the standard release.
- The job is targetted to run on windows, linux or both host operating systems.
- The nice user flag should be set for low priority jobs.
- The Single Slot flag specifies whether the job should run on a single CPU slot, or across all available CPUs if unchecked (the default).
- The choice of the number of simulations per job to run has many facets. The general principle is that the longer a job runs, the higher its throughput - as less time is spent setting up and tearing down the job. However, longer running jobs have a higher chance of being interrupted if a higher priority user enters the queue. On the Toowoomba cluster, Linux hosts have either 8 or 16 CPUs per machine, the Windows hosts have 12. Some experimentation may be required to achieve a reasonable runtime - typically 20 minutes per job.
- The output of this process is to either write a single zipfile for uploading, or to directly upload the file to the cluster. Its name will be the time when created.
- At some later time, the output of this simulation can be downloaded from the cluster. It will have the same name, but a ".out.zip" extension.
The Apsim JV condor cluster has been upgraded; and several things have changed.
- There are no more 32 bit hosts.
- Instead of running "one simulation per CPU", the system attempts to run several simulations in parallel, using all available CPUs on a single machine, significantly reducing the amount of network traffic for each job. This changes the interpretation of "number of simulations per job" - as the throughput is now related to the number of CPU cores as well as their speed.
- Extra tags have been added to the toowoomba condor machines to facilitate matching jobs with whole machines.
- There is a PBS job submission script in the zipfile, which will require a small hand edit for your installation
- It is possible to run Apsim on clusters without native mono .NET installations.
Users still need to register (with peter or justin) to obtain a username/password.
Files can be transferred to the cluster by:
- ftp or sftp
- web, via https://apsrunet.apsim.info/uploadForm.php
- The ApsimUI, automagically
No Apsim installations are present on the new machines.
Some background info on condor can be read here
Use the ApsimUI to submit a job. Enter your username/password. Later on, open ftp://<username>:<password>@apsrunet.apsim.info: and look for an .out.zip file.
Using Amazon's EC2 cloud to run APSIM jobs
Amazon's EC2 service
can be configured to run a condor cluster - though its pricing policy favours Linux hosts ahead of Windows. Recent (post 7.1) versions of Apsim run on 32 bit unix platforms without trouble; r1535
is recommended for use with Condor.
Given the high network latency between the cloud and our workstations, it's preferable to run the entire cluster inside the cloud: a single master and multiple workers. To configure APSIM on the workers, take a vanilla AMI and install the following packages:
apt-get -y install libboost1.40 libboost-thread1.40.0 libboost-date-time1.40.0 libboost-filesystem1.40.0 libboost-regex1.40.0 libxml2 libc6-amd64 mono-runtime libgfortran3 tcl8.4 tcllib tdom
The Apsim binaries can be unpacked under /opt. Condor binaries are available from the condor website.
When an EC2 host is started, it is behind a NAT firewall and has a dynamic IP address; it knows very little about the world around it. The condor daemons need to be told which pool they belong to and where to find them. To dynamically configure condor, run the following shell script at boot time to point the condor master & workers at the host nominated in the AMI user data field. Take a snapshot of the AMI at this point - it's ready to be started.
# Set up condor local config
# "master" : I am a condor master running in the cloud
# "other hostname" : I am a condor worker, talking to a master called "other hostname"
# Use shared ports if we're a worker talking to a master outside the cloud
if [ "$userdata" = "master" ] ; then
echo Setting up cloud only master at $private_name
cat > /etc/condor/condor_config.local <<
, SCHEDD, COLLECTOR, NEGOTIATOR
if [ "$userdata" = "apsrunet.apsim.info" ] ; then
echo Setting up worker, master = $userdata
cat > /etc/condor/condor_config.local <<
, STARTD, SHARED_PORT
TCP_FORWARDING_HOST = $public_addr
PRIVATE_NETWORK_INTERFACE = $private_addr
if [ "$userdata" != "" ] ; then
echo Setting up cloud only worker, master = $userdata
cat > /etc/condor/condor_config.local <<
CONDOR_HOST = $userdata
echo Cant work out what class of condor pool this is. Giving up.
cat >> /etc/condor/condor_config.local <<EOFM4
COLLECTOR_NAME = Apsim
ALLOW_WRITE = *
ALLOW_READ = *
COUNT_HYPERTHREAD_CPUS = False
# EC2 workers don't have shared filesystems or authentication
UID_DOMAIN = \$(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = \$(FULL_HOSTNAME)
USE_NFS = False
USE_AFS = False
UPDATE_COLLECTOR_WITH_TCP = True
# Allow local host and the central manager to manage the node
HOSTALLOW_ADMINISTRATOR = \$(FULL_HOSTNAME), \$(COLLECTOR_HOST)
# Use random numbers here so the workers don't all hit the collector at
# the same time. If there are many workers the collector can get overwhelmed.
UPDATE_INTERVAL = \$RANDOM_INTEGER(230, 370)
MASTER_UPDATE_INTERVAL = \$RANDOM_INTEGER(230, 370)
JAVA_CLASSPATH_DEFAULT = \$(LIBEXEC) \$(LIBEXEC)/lib \$(LIBEXEC)/lib/scimark2lib.jar
ALLOW_DAEMON = *
SEC_PASSWORD_FILE = \$(LOCK)/pool_password
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
SEC_NEGOTIATOR_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD, KERBEROS, GSI
chown condor.condor /var/run/condor
The following shell script starts one pool master and multiple workers using the EC2 toolset. Alternatively, the instances can be managed from the AWS control panel, provided that the "user data" field for workers is set to the internal DNS of the pool master. In addition, it's important to open many ports (9618 and 40000-41000) for the condor daemons - each "job" requires 5 open ports, so the master must have 5 times the number of worker machines. This is done by creating a security group chosen when the instance is created.
##relies on EC2_PRIVATE and EC2_CERT
ec2-run-instances $IMAGE --region $ZONE -n 1 -d master -g Worker -t c1.medium -k $KEYPAIR > manager-instance.txt
MANAGER=`grep ^INSTANCE manager-instance.txt | cut -f 2 - `
echo Started manager at $MANAGER
ec2-authorize Worker --region $ZONE -P tcp -p 9618 -u $EC2_USER -o Worker
ec2-authorize Worker --region $ZONE -P tcp -p 40000-40050 -u $EC2_USER -o Worker
ec2-authorize Worker --region $ZONE -P udp -p 40000-40050 -u $EC2_USER -o Worker
while 1 do ;
# Internal IP of manager
ec2-describe-instances $MANAGER --region $ZONE > manager-instancedata.txt
MANAGER_INTERNAL=`grep ^INSTANCE manager-instancedata.txt |cut -f 5`
if [ $MANAGER_INTERNAL != "" ] then
MANAGER_NAME=`grep ^INSTANCE manager-instancedata.txt |cut -f 4`
echo starting workers under manager $MANAGER_NAME
ec2-run-instances $IMAGE --region $ZONE -n $num -d $MANAGER_INTERNAL -g Worker -t c1.medium -k $KEYPAIR > worker-instance.txt
echo Started workers `grep ^INSTANCE worker-instance.txt | cut -f 2 - `
Note that UDP will fail inside the cloud, so specify "UPDATE_COLLECTOR_WITH_TCP = True" in your condor config.
To save you time, Peter deVoil (email@example.com) can share AMIs with Apsim and Condor installed and configured as above.