By William Streck… |

CRS Job Description Files

CRS jobs are defined by a job description file (jdf) which gets submitted to the CRS submit daemon.  They are formatted as a standard .ini-style config file, that looks like:

# A comment starts with a pound-sign

[section-name]
variable = value
something = else

There are a number of required sections and variables, and some optional ones.  You first define a section called [main] where you say how many inputs and outputs you have, as well as specify a number of optional variables.  Then you define a number of executable sections, corresponding to the jobs you wish to run and optionally which inputs/outputs are for each job.  Then you define the input and output file names and their types.

Required Sections and Arguments

  • [main]: Must exist and contain the following
    • num_inputs / num_outputs = <int>
  • [exec-0]:
    • exec = <full path of executable>
  • [input-0], [output-0]
    • type = <file type, HPSS, DCACHE, etc...>
    • path = <path up to the folder containing file, e.g. "/tmp/folder">
    • name = <filename, in folder specified in path, e.g. "file_1.txt">

The input, output, and exec sections can have multiple instances formatted like [input-n] where n is an integer starting at 0 that numbers the executable.

Optional Arguments

Main

  • queue = <defined queue: see here for details>
  • max_tries = <int> Number of times the job will retry on a stage-miss before giving up, defaults to 2
  • on_error / on_complete = <command> Command that runs in a shell when a job enters the DONE or ERROR states.  This can happen on the submit or execute node, so this executable should be available in NFS/AFS from everywhere the job may be.  This will block until it is done, so please keep these short and error-free.
  • name = <string> a name for the job that must be unique across all jobs currently in CRS
  • auto_remove = <bool> (true/false, yes/no, or 1/0), determines if the job will be automatically deleted from the CRS database on successful completion

Exec-N

  • args = <string> The arguments to pass to the executable on the command line, will not be passed through a shell so no escaping is needed.
  • env = <string> A string formatted as "<KEY>=<VALUE>;<KEY2>=<VALUE2>"; i.e. a semi-colon separated string of key value pairs separated by an equal sign.  These environment variables will be passed to the executable at the time it is run, in addition to those in the condor job and provided by CRS.
  • inputs / outputs = <int[,int,...]> Necessary only if there are multiple [exec-n] sections, an integer or comma-separated list of integers corresponding to which input / output files are used by this executable.
  • stdout / stderr = <string> The full path of where to put a job's standard out/error streams when it completes, must be locally accessible (NFS, etc.)
  • gzip_output = <bool> (true/false, 1/0, etc...) (default false) to compress job stdout/err streams as they are written to final destination

See example job file below for detail

Running Jobs

CRS jobs can be created by running "crs_job -create <file(s)>" or "crs_create <file(s)>", which take a JDF as an argument and inserts the jobs into the database, returning the unique ID for each job.  In order for jobs to be considered to run, you must submit them via "crs_job -submit <jobid(s)>" or "crs_submit <jobid(s)>".  This allows you to create many jobs ahead of time and submit them in batches.  The command "crs_insert <file(s)>" is just a shortcut that does the previous two steps in one, taking a job file, creating the job and submitting it to run.

Environment Variables

CRS jobs can have environment variables from three sources.  First, when the job is submitted to Condor, it reads from a template that can specify environment variables, although you will not need to touch this unless there are variables common to all you jobs.  Next, each executable can specify its own environment variables.  Lastly, CRS provides some of its own environment variables to each executable.  These are listed below

Environment Variable Value
INPUT<N> Filename of Nth input for that executable, no directory
OUTPUT<N> Filename of Nth output for that executable, no directory
ACTUAL_INPUT<N> Full path of Nth input, as specified in jdf
ACTUAL_OUTPUT<N> Full path of Nth output, as specified in jdf
STD_OUT Local filename of exec's standard output
STD_ERR Local filename of exec's standard error
CRS_STDOUT_DIR Full path to directory were exec's standard out will go
CRS_STDERR_DIR Full path to directory were exec's standard error will go
CRS_JOB_NAME If provided in JDF, the job name, otherwise meaningless

Exit Codes, Error Codes, and Callbacks

When the CRS job is running on a farm node and it enters into the ERROR or DONE state, if it is configured to do so, it will execute a callback.  This callback is simply called within a shell underneath the main crs-exec process.  This means that they should be robust and short, i.e. don't do much and don't block for too long while doing it, because the job will hang until this command is done.

The job will set the following error codes in the corresponding scenarios.

Error Code Reason
10 The job is gone from the condor queue
20 The job encountered and error pre-staging one or more of its input files
30 The job retried too many times and is giving up
40 An error associated with importing the job's input files
50 Error related to job-execution
60 An error occurred exporting the jobs output
70 Unspecified I/O Error occured while waiting to import/export

Error codes 40 and above will occur when the job is executing and the job will exit with that status as well.  For example, if an IO error with PFTP, the job will set the code to 30 and will exit with status code 30 as well.

SSH To Job

Running CRS jobs can be ssh'd into using Condor's "ssh-to-job" capability.  All that is required is to run "crs_ssh <jobid>" and CRS will try to ssh into the running job.  You are put on the machine where the job is running as the *reco user.  As of now, you are left in the root directory -- but the job's current scratch directory is available in the $_CONDOR_SCRATCH_DIR environment variable, so you can just "cd" into that directory and poke around.

If the job finishes while you are ssh'd into that machine you will be kicked out in a couple of minutes.  The reason is that in order to keep jobs consistent between CRS and condor, the CRS submit-daemon scans for jobs that it thinks are done and removes them from condor if necessary.

 

Sample Jobs

# The main section contains declarations of the number of inputs and outputs
# the job expects to make, along with an optional queue selection (which
# defaults to a queue named 'default' that is always in the database).  

[main]
queue = low

num_inputs = 3
num_outputs = 3

# This is how many HPSS cache-misses are allowable (cycles that the program
# would go through STAGE->RUN->RETRY->STAGE again) before it fails altogether
max_tries = 2

# There are also two optional callbacks to be executed when the job either
# completes or fails with an error.  They will be interperted by a shell
# (pipes, etc are OK).  These callbacks can be executed anywhere the job enters
# an error condition, either during staging (on the submit node) or running (on
# the farm node), so please specify an executable that is accessible anywhere
# (network filesystem or in the image)
on_error = /path/to/executable -arg1 arg2
on_complete = echo "Done" | nc somehost 2222

# Jobs can be given a name, must be alphanumeric plus underscore or dash only,
# and must be unique -- jobs cannot have conflicting names in the database.
# You will be able to refer to these jobs via the name provided here in the
# command line wherever a <jobspec> field is specified in the documentation.
name = something_alphanumeric_here

# Jobs with this set to "True"/"Yes"/"1"/"on" will be removed from the database
# automatically within 2 hours of them entering the DONE state
auto_remove = true

# Notice how with more than one [exec] section you need to specify the
# pertinent inputs and outputs for each executable.  Notice also that you can
# repeat inputs for multiple executables, and that you must cover all files
# defined below with the set of input/output files declared in all executables 
# or you will get an error.  If gzip_output flag is present and set to True the
# standard output/error streams will be compressed before writing them to their
# final locations.

[exec-0]
exec = /afs/rhic.bnl.gov/star/something-here/root_caller.csh
args = -t 32 -D -s '/star/u/' -p
inputs = 0,1
gzip_output = True
stdout = /phenix/zdata99/whatever/this.txt
stderr = /tmp/joblog/that.txt
outputs = 1

[exec-1]
exec = /afs/rhic.bnl.gov/star/something-here/root_caller_again.csh
args = -t 32 -D -s '/star/u/'
inputs = 0,2
outputs = 0

[exec-2]
exec = /afs/rhic.bnl.gov/star/something-here/wrap_up
args = -v -q 23
env = CONFIG_DIR=/home/starreco/cfg;EXAMPLE_ENV=whatever
outputs = 2

# ************ Input files ************

[input-0]
type = HPSS
path = /home/starsink/raw/daq/2010/035/11035070/
file = st_upc_11035070_raw_3280001.daq

[input-2]
type = HPSS
path = /home/starsink/raw/daq/2010/035/11035219/
file = st_upc_11411579_raw_5341001.daq

[input-1]
type = HPSS
path = /home/starsink/raw/daq/2010/033/11033073/
file = st_physics_11033073_raw_3030002.daq

# ************** Outputs *******************
[output-0]
type = HPSS
path = /home/starreco/reco/AuAu2010_production/ReversedFullField/P10ij/2010/035/
file = st_upc_1242424234

[output-1]
type = DCACHE
path = /pnfs/rcf.bnl.gov/star/starreco/scratch//run10auau_zerof_pro85/what/
file = DST_MPC_UP_run10auau_zerof-2432342341121

[output-2]
type = LOCAL
path = /phenix/zdata01/phnxreco/run10auau_zerof_pro85/2342_DST_242323/
file = testfile