CRS Job Description Files
CRS jobs are defined by a job description file (jdf) which gets submitted to the CRS submit daemon. They are formatted as a standard .ini-style config file, that looks like:
# A comment starts with a pound-sign [section-name] variable = value something = else
There are a number of required sections and variables, and some optional ones. You first define a section called [main] where you say how many inputs and outputs you have, as well as specify a number of optional variables. Then you define a number of executable sections, corresponding to the jobs you wish to run and optionally which inputs/outputs are for each job. Then you define the input and output file names and their types.
Required Sections and Arguments
- [main]: Must exist and contain the following
- num_inputs / num_outputs = <int>
- exec = <full path of executable>
- [input-0], [output-0]
- type = <file type, HPSS, DCACHE, etc...>
- path = <path up to the folder containing file, e.g. "/tmp/folder">
- name = <filename, in folder specified in path, e.g. "file_1.txt">
The input, output, and exec sections can have multiple instances formatted like [input-n] where n is an integer starting at 0 that numbers the executable.
- queue = <defined queue: see here for details>
- max_tries = <int> Number of times the job will retry on a stage-miss before giving up, defaults to 2
- on_error / on_complete = <command> Command that runs in a shell when a job enters the DONE or ERROR states. This can happen on the submit or execute node, so this executable should be available in NFS/AFS from everywhere the job may be. This will block until it is done, so please keep these short and error-free.
- name = <string> a name for the job that must be unique across all jobs currently in CRS
- auto_remove = <bool> (true/false, yes/no, or 1/0), determines if the job will be automatically deleted from the CRS database on successful completion
- args = <string> The arguments to pass to the executable on the command line, will not be passed through a shell so no escaping is needed.
- env = <string> A string formatted as "<KEY>=<VALUE>;<KEY2>=<VALUE2>"; i.e. a semi-colon separated string of key value pairs separated by an equal sign. These environment variables will be passed to the executable at the time it is run, in addition to those in the condor job and provided by CRS.
- inputs / outputs = <int[,int,...]> Necessary only if there are multiple [exec-n] sections, an integer or comma-separated list of integers corresponding to which input / output files are used by this executable.
- stdout / stderr = <string> The full path of where to put a job's standard out/error streams when it completes, must be locally accessible (NFS, etc.)
- gzip_output = <bool> (true/false, 1/0, etc...) (default false) to compress job stdout/err streams as they are written to final destination
See example job file below for detail
CRS jobs can be created by running "crs_job -create <file(s)>" or "crs_create <file(s)>", which take a JDF as an argument and inserts the jobs into the database, returning the unique ID for each job. In order for jobs to be considered to run, you must submit them via "crs_job -submit <jobid(s)>" or "crs_submit <jobid(s)>". This allows you to create many jobs ahead of time and submit them in batches. The command "crs_insert <file(s)>" is just a shortcut that does the previous two steps in one, taking a job file, creating the job and submitting it to run.
CRS jobs can have environment variables from three sources. First, when the job is submitted to Condor, it reads from a template that can specify environment variables, although you will not need to touch this unless there are variables common to all you jobs. Next, each executable can specify its own environment variables. Lastly, CRS provides some of its own environment variables to each executable. These are listed below
|INPUT<N>||Filename of Nth input for that executable, no directory|
|OUTPUT<N>||Filename of Nth output for that executable, no directory|
|ACTUAL_INPUT<N>||Full path of Nth input, as specified in jdf|
|ACTUAL_OUTPUT<N>||Full path of Nth output, as specified in jdf|
|STD_OUT||Local filename of exec's standard output|
|STD_ERR||Local filename of exec's standard error|
|CRS_STDOUT_DIR||Full path to directory were exec's standard out will go|
|CRS_STDERR_DIR||Full path to directory were exec's standard error will go|
|CRS_JOB_NAME||If provided in JDF, the job name, otherwise meaningless|
Exit Codes, Error Codes, and Callbacks
When the CRS job is running on a farm node and it enters into the ERROR or DONE state, if it is configured to do so, it will execute a callback. This callback is simply called within a shell underneath the main crs-exec process. This means that they should be robust and short, i.e. don't do much and don't block for too long while doing it, because the job will hang until this command is done.
The job will set the following error codes in the corresponding scenarios.
|10||The job is gone from the condor queue|
|20||The job encountered and error pre-staging one or more of its input files|
|30||The job retried too many times and is giving up|
|40||An error associated with importing the job's input files|
|50||Error related to job-execution|
|60||An error occurred exporting the jobs output|
|70||Unspecified I/O Error occured while waiting to import/export|
Error codes 40 and above will occur when the job is executing and the job will exit with that status as well. For example, if an IO error with PFTP, the job will set the code to 30 and will exit with status code 30 as well.
SSH To Job
Running CRS jobs can be ssh'd into using Condor's "ssh-to-job" capability. All that is required is to run "crs_ssh <jobid>" and CRS will try to ssh into the running job. You are put on the machine where the job is running as the *reco user. As of now, you are left in the root directory -- but the job's current scratch directory is available in the $_CONDOR_SCRATCH_DIR environment variable, so you can just "cd" into that directory and poke around.
If the job finishes while you are ssh'd into that machine you will be kicked out in a couple of minutes. The reason is that in order to keep jobs consistent between CRS and condor, the CRS submit-daemon scans for jobs that it thinks are done and removes them from condor if necessary.
# The main section contains declarations of the number of inputs and outputs # the job expects to make, along with an optional queue selection (which # defaults to a queue named 'default' that is always in the database). [main] queue = low num_inputs = 3 num_outputs = 3 # This is how many HPSS cache-misses are allowable (cycles that the program # would go through STAGE->RUN->RETRY->STAGE again) before it fails altogether max_tries = 2 # There are also two optional callbacks to be executed when the job either # completes or fails with an error. They will be interperted by a shell # (pipes, etc are OK). These callbacks can be executed anywhere the job enters # an error condition, either during staging (on the submit node) or running (on # the farm node), so please specify an executable that is accessible anywhere # (network filesystem or in the image) on_error = /path/to/executable -arg1 arg2 on_complete = echo "Done" | nc somehost 2222 # Jobs can be given a name, must be alphanumeric plus underscore or dash only, # and must be unique -- jobs cannot have conflicting names in the database. # You will be able to refer to these jobs via the name provided here in the # command line wherever a <jobspec> field is specified in the documentation. name = something_alphanumeric_here # Jobs with this set to "True"/"Yes"/"1"/"on" will be removed from the database # automatically within 2 hours of them entering the DONE state auto_remove = true # Notice how with more than one [exec] section you need to specify the # pertinent inputs and outputs for each executable. Notice also that you can # repeat inputs for multiple executables, and that you must cover all files # defined below with the set of input/output files declared in all executables # or you will get an error. If gzip_output flag is present and set to True the # standard output/error streams will be compressed before writing them to their # final locations. [exec-0] exec = /afs/rhic.bnl.gov/star/something-here/root_caller.csh args = -t 32 -D -s '/star/u/' -p inputs = 0,1 gzip_output = True stdout = /phenix/zdata99/whatever/this.txt stderr = /tmp/joblog/that.txt outputs = 1 [exec-1] exec = /afs/rhic.bnl.gov/star/something-here/root_caller_again.csh args = -t 32 -D -s '/star/u/' inputs = 0,2 outputs = 0 [exec-2] exec = /afs/rhic.bnl.gov/star/something-here/wrap_up args = -v -q 23 env = CONFIG_DIR=/home/starreco/cfg;EXAMPLE_ENV=whatever outputs = 2 # ************ Input files ************ [input-0] type = HPSS path = /home/starsink/raw/daq/2010/035/11035070/ file = st_upc_11035070_raw_3280001.daq [input-2] type = HPSS path = /home/starsink/raw/daq/2010/035/11035219/ file = st_upc_11411579_raw_5341001.daq [input-1] type = HPSS path = /home/starsink/raw/daq/2010/033/11033073/ file = st_physics_11033073_raw_3030002.daq # ************** Outputs ******************* [output-0] type = HPSS path = /home/starreco/reco/AuAu2010_production/ReversedFullField/P10ij/2010/035/ file = st_upc_1242424234 [output-1] type = DCACHE path = /pnfs/rcf.bnl.gov/star/starreco/scratch//run10auau_zerof_pro85/what/ file = DST_MPC_UP_run10auau_zerof-2432342341121 [output-2] type = LOCAL path = /phenix/zdata01/phnxreco/run10auau_zerof_pro85/2342_DST_242323/ file = testfile