From: Betty McBreen (mcbreen@sgs1.hirg.bnl.gov)
Date: Tue Feb 11 2003 - 08:55:00 EST
----- Original Message ----- From: "Tony W. Chan" <tony@voyager.rhic.bnl.gov> To: <mcbreen@bnl.gov>; <ejkim@bnl.gov>; <chujo@bnl.gov>; <momchil@bnl.gov>; <burt@bnl.gov>; <belt@bnl.gov>; <didenko@bnl.gov>; <jeromel@bnl.gov> Cc: <throwe@bnl.gov>; <tomw@bnl.gov> Sent: Monday, February 10, 2003 5:27 PM Subject: CRS job notification > Hi, Feb. 10, 2003 > > After lots of debugging, I put into production > a enhanced CRS job notification protocol. There > are now 4 new categories of job failures that > will notify whoever has been designated in the > JDF file (i.e., bramreco, phnxreco, phobreco, > starreco). In principle, it will go to either > rcrsuser1 or rcrsuser2 (when available). The > default (if database is missing that info) > is rcrsuser1. If you get a message with one > of the following key words, it means the > faulty job has been cleaned up, and you should > resubmit it. > > This type of notification will only occur twice > a day per experiment, since I use a cron job to > look for these discrepancies. I limit it to twice > a day to avoid overloading the database with > requests. > > Please pass this message to the person(s) submitting > jobs for your respective experiment to notify them > of this update. Thanks. > > Cheers, > > Tony > > ***************************************************** > > 1) transfer failed --> job in database, but not in > the assigned node. One or > more of the input and/or > JDF file(s) did not transfer > successfully to CRS node > > 2) msg. failed --> job status says "executing", > but "ps" shows job not running. > Assume failed to "transfer out" > due to message passing failure. > > 3) db failed (1) --> job in node (and # of "running" > jobs in db > 0), but partial db > info is missing. Assume db update > failed. > > 4) db failed (2) --> job in node (and # of "running" > jobs in db = 0), but partial db > info is missing. Assume db update > failed. > > >
This archive was generated by hypermail 2.1.5 : Tue Feb 11 2003 - 08:56:02 EST