Running APSIM on compute clusters
To run APSIM remotely, the simulations, and possibly apsim itself, have to be packaged, uploaded, scheduled, executed and downloaded from remote machines. Each compute cluster will have a different way of accomplishing each of these steps. The APSIM community has developed tools that while being specific to each cluster, may be generalised to suit other purposes.
This page discusses a) methods of running apsim on remote hardware, and b) applications to package simulations to run on remote hardware.
Remote execution requires building a version of apsim that can run the machines of the cluster. If the cluster hardware is comprised of windows machines, the standard apsim release (or a self extracting executable from bob) is adequate. Unix clusters require either a native build of apsim (discussed in generic terms
and specific to USQ
), or a container with a build of apsim inside. Software containers
such as docker
avoid the need to compile and deploy apsim for each piece of hardware in the cluster, making management of the cluster much simpler in a large multiuser environment. Scripts to build apsim in a container are in the Release directory of the apsim source code.
Packaging an apsim simulation involves rewriting the components that use external input files (eg met, ini components) to avoid absolute path names that would not exist on the remote machine. The simulations are usually grouped into "jobs" that balance the time taken to run the simulation against the time taken to transfer the inputs.
A scheduler is responsible for accepting a (packaged) job and arranging for a remote machine to execute it. Three schedulers are in common use within the APSIM community:
- Microsoft's AZURE, a cloud computing service deployed over a global network of data centers.
- Condor is an advanced HTC scheduler that can run apsim simulations. Some background info on condor can be read here. It can run on unix or windows, or both. It provides methods for file transfer between submit machines and executing nodes, and methods to specify the priority of the job.
- PBS clusters are used by many australian universities and larger research organisations. PBS is a unix based system that relies on NFS for file transfer of both the apsim input data and the singularity container.
Using the APSIM User Interface tool to create condor & PBS jobs
To submit jobs to these systems
- In ApsimUI, select the Generate Simulations button
- You have the choice of running all simulations in the currently open .apsim file OR all .apsim files in a directory on your hard disk.
- The "Location of self extracting APSIM" entry is how a specific version of apsim is run. The default (a url) will be the current release.
- The job is targetted to run on windows, linux or both host operating systems.
- The nice user flag should be set for low priority jobs.
- The Single Slot flag specifies whether the job should run on a single CPU slot, or across all available CPUs if unchecked (the default).
- The choice of the number of simulations per job to run has many facets. The general principle is that the longer a job runs, the higher its throughput - as less time is spent setting up and tearing down the job. However, longer running jobs have a higher chance of being interrupted if a higher priority user enters the queue.
- The output of this process is to write a single zipfile for uploading. Its name will be the time of day when created.
Using APSIM on AZURE
Agresearch has developed a tool to run apsim classic on the microsoft Azure cloud.
MAke, Run and Summarise Multiple APSIM Simulations (MARS) User guide can be found here.
Using the APSIM on UQ-HPC (awoonga, tinaroo)
There is a standalone tool available to run simulations on the UQ clusters - both use a PBSPro scheduler running on dedicated and virtual hardware. It relies on singularity containers being updated by Peter deVoil
. Users need to contact rcc for a username / password. It will upload data to the users home directory, submit the job, and download the outputs when finished.
Using Amazon's EC2 cloud to run APSIM jobs
Amazon's EC2 service
can be configured to run a condor cluster - though its pricing policy favours Linux hosts ahead of Windows.
Given the high network latency between the cloud and our workstations, it's preferable to run the entire cluster inside the cloud: a single master and multiple workers. To configure APSIM on the workers, take a vanilla AMI and install the following packages:
apt-get -y install libxml2 libc6-amd64 mono-runtime libgfortran3 tcl8.5 tcllib tdom
The Apsim binaries can be unpacked under /opt. Condor binaries are available from the condor website.
When an EC2 host is started, it is behind a NAT firewall and has a dynamic IP address; it knows very little about the world around it. The condor daemons need to be told which pool they belong to and where to find them. To dynamically configure condor, run the following shell script at boot time to point the condor master & workers at the host nominated in the AMI user data field. Take a snapshot of the AMI at this point - it's ready to be started.
# Set up condor local config
# "master" : I am a condor master running in the cloud
# "other hostname" : I am a condor worker, talking to a master called "other hostname"
# Use shared ports if we're a worker talking to a master outside the cloud
if [ "$userdata" = "master" ] ; then
echo Setting up cloud only master at $private_name
cat > /etc/condor/condor_config.local <<
, SCHEDD, COLLECTOR, NEGOTIATOR
if [ "$userdata" = "apsrunet.apsim.info" ] ; then
echo Setting up worker, master = $userdata
cat > /etc/condor/condor_config.local <<
, STARTD, SHARED_PORT
TCP_FORWARDING_HOST = $public_addr
PRIVATE_NETWORK_INTERFACE = $private_addr
if [ "$userdata" != "" ] ; then
echo Setting up cloud only worker, master = $userdata
cat > /etc/condor/condor_config.local <<
CONDOR_HOST = $userdata
echo Cant work out what class of condor pool this is. Giving up.
cat >> /etc/condor/condor_config.local <<EOFM4
COLLECTOR_NAME = Apsim
ALLOW_WRITE = *
ALLOW_READ = *
COUNT_HYPERTHREAD_CPUS = False
# EC2 workers don't have shared filesystems or authentication
UID_DOMAIN = \$(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = \$(FULL_HOSTNAME)
USE_NFS = False
USE_AFS = False
UPDATE_COLLECTOR_WITH_TCP = True
# Allow local host and the central manager to manage the node
HOSTALLOW_ADMINISTRATOR = \$(FULL_HOSTNAME), \$(COLLECTOR_HOST)
# Use random numbers here so the workers don't all hit the collector at
# the same time. If there are many workers the collector can get overwhelmed.
UPDATE_INTERVAL = \$RANDOM_INTEGER(230, 370)
MASTER_UPDATE_INTERVAL = \$RANDOM_INTEGER(230, 370)
JAVA_CLASSPATH_DEFAULT = \$(LIBEXEC) \$(LIBEXEC)/lib \$(LIBEXEC)/lib/scimark2lib.jar
ALLOW_DAEMON = *
SEC_PASSWORD_FILE = \$(LOCK)/pool_password
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
SEC_NEGOTIATOR_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD, KERBEROS, GSI
chown condor.condor /var/run/condor
The following shell script starts one pool master and multiple workers using the EC2 toolset. Alternatively, the instances can be managed from the AWS control panel, provided that the "user data" field for workers is set to the internal DNS of the pool master. In addition, it's important to open many ports (9618 and 40000-41000) for the condor daemons - each "job" requires 5 open ports, so the master must have 5 times the number of worker machines. This is done by creating a security group chosen when the instance is created.
##relies on EC2_PRIVATE and EC2_CERT
ec2-run-instances $IMAGE --region $ZONE -n 1 -d master -g Worker -t c1.medium -k $KEYPAIR > manager-instance.txt
MANAGER=`grep ^INSTANCE manager-instance.txt | cut -f 2 - `
echo Started manager at $MANAGER
ec2-authorize Worker --region $ZONE -P tcp -p 9618 -u $EC2_USER -o Worker
ec2-authorize Worker --region $ZONE -P tcp -p 40000-40050 -u $EC2_USER -o Worker
ec2-authorize Worker --region $ZONE -P udp -p 40000-40050 -u $EC2_USER -o Worker
while 1 do ;
# Internal IP of manager
ec2-describe-instances $MANAGER --region $ZONE > manager-instancedata.txt
MANAGER_INTERNAL=`grep ^INSTANCE manager-instancedata.txt |cut -f 5`
if [ $MANAGER_INTERNAL != "" ] then
MANAGER_NAME=`grep ^INSTANCE manager-instancedata.txt |cut -f 4`
echo starting workers under manager $MANAGER_NAME
ec2-run-instances $IMAGE --region $ZONE -n $num -d $MANAGER_INTERNAL -g Worker -t c1.medium -k $KEYPAIR > worker-instance.txt
echo Started workers `grep ^INSTANCE worker-instance.txt | cut -f 2 - `
Note that UDP will fail inside the cloud, so specify "UPDATE_COLLECTOR_WITH_TCP = True" in your condor config.