By William Streck… | Mon, 05/03/2021 - 13:26


Submit  Machine Software

The CRS software consists of a pair of daemons that run as the reco-users on dedicated submit machines (currently rcrsuser1 for PHENIX / rcrsuser3 for STAR) in conjunction with the Condor batch software. One daemon is simply a logserver that writes log-messages from the running jobs to a folder in the reco-user's (local) home directory. The other is the submit-daemon which does most of the work in CRS.  It is responsible for receiving incoming jobs, submitting and managing the stage requests, then, when they are ready, submitting and managing the jobs in condor.

The user should never have to access jobs directly in condor, as CRS monitors the batch system manages the state of its jobs. If a job goes missing in Condor, CRS will mark it as an error; likewise for jobs being held.  Such jobs can be managed by the regular CRS tools.  If you kill a job via crs_kill the condor job removed for you, and if you accidentally delete a running job, CRS checks the consistency between its database and condor and will take action to start / kill condor jobs to make things consistent.

Users have control of these daemons via the "crsctl" command.  The daemons write logs to ~/crs/logs/, and if for some reason they crash, this will be the first place to look.  Currently, the submit-daemon will crash if the HPSS database goes down, but it simply need to be restarted via "crsctl submitd start".  This limitation will change soon. However, even if daemons crash, they will recover the state of the jobs they were managing since they crashed.  It is possible to stop and start them in the middle of a run and they recover all outstanding stage requests and queued jobs OK.

Error Handling

Jobs that go wrong will usually end up in the ERROR state, and the job logs are the place to look for the problem.  Most operations (like PFTP) will retry several times before giving up, so the effect may not be immediately visible.

As mentioned above, the first place to debug submit-daemon problems is ~/crs/logs.  For jobs running on the farm, the job-wrapper-script logs messages to the logserver on the rcrsuser machines, which writes logs to ~/job-logs/.  These logs, along with the actual condor logs (under ~/job-logs/condor/) are cleaned up automatically after a few days to prevent the disk from filling up.  The logs contain a wealth of debug and timing information from the job.  If there are PFTP or DCCP problems, the stdout/stderr/exit-code of those processes are logged here.

Subsystem Blocking Flags

There is a table of I/O subsystems that can be blocked on a per-experiment basis.  For example, if there is a known problem with PFTP, the user or admin can run "crs_service_block -set pftp off" to toggle the PFTP flag off. Every job, before importing/exporting files to HPSS, checks first the HPSS then the PFTP flag, and if they see it set, they enter a waiting pattern.  Jobs waiting will poll every few minutes for the flag and loop until it is set again.  Only then will jobs start to continue.  For example, if PHENIX sets the DCACHE flag off, jobs will stage, import, and run OK, but when it comes time to export, the files going to DCACHE will not move until the flag is set on again.  Likewise, if STAR sets the PFTP flag off, all jobs will wait before importing/exporting any files.

Stage requests cannot be controlled with this mechanism because we don't issue the stage requests directly in CRS (David Yu provides a database and a simple API we use).  If there is a problem staging files, David will stop updating the table we read from and therefore no new jobs will move from STAGING to SUBMITTED.  If the user wishes to cancel stage requests, he/she can kill/reset/hold the jobs that are QUEUED/STAGING as appropriate, then resubmit them at a later time.