Fw: CRS job notification

From: Betty McBreen (mcbreen@sgs1.hirg.bnl.gov)
Date: Tue Feb 11 2003 - 08:55:00 EST

  • Next message: Eun-Joo Kim: "drift velocity"
    ----- Original Message -----
    From: "Tony W. Chan" <tony@voyager.rhic.bnl.gov>
    To: <mcbreen@bnl.gov>; <ejkim@bnl.gov>; <chujo@bnl.gov>; <momchil@bnl.gov>;
    <burt@bnl.gov>; <belt@bnl.gov>; <didenko@bnl.gov>; <jeromel@bnl.gov>
    Cc: <throwe@bnl.gov>; <tomw@bnl.gov>
    Sent: Monday, February 10, 2003 5:27 PM
    Subject: CRS job notification
    
    
    > Hi,                            Feb. 10, 2003
    >
    > After lots of debugging, I put into production
    > a enhanced CRS job notification protocol. There
    > are now 4 new categories of job failures that
    > will notify whoever has been designated in the
    > JDF file (i.e., bramreco, phnxreco, phobreco,
    > starreco). In principle, it will go to either
    > rcrsuser1 or rcrsuser2 (when available). The
    > default (if database is missing that info)
    > is rcrsuser1. If you get a message with one
    > of the following key words, it means the
    > faulty job has been cleaned up, and you should
    > resubmit it.
    >
    > This type of notification will only occur twice
    > a day per experiment, since I use a cron job to
    > look for these discrepancies. I limit it to twice
    > a day to avoid overloading the database with
    > requests.
    >
    > Please pass this message to the person(s) submitting
    > jobs for your respective experiment to notify them
    > of this update. Thanks.
    >
    > Cheers,
    >
    > Tony
    >
    > *****************************************************
    >
    > 1) transfer failed --> job in database, but not in
    >                        the assigned node. One or
    >                        more of the input and/or
    >                        JDF file(s) did not transfer
    >                        successfully to CRS node
    >
    > 2) msg. failed     --> job status says "executing",
    >                        but "ps" shows job not running.
    >                        Assume failed to "transfer out"
    >                        due to message passing failure.
    >
    > 3) db failed (1)   --> job in node (and # of "running"
    >                        jobs in db > 0), but partial db
    >                        info is missing. Assume db update
    >                        failed.
    >
    > 4) db failed (2)   --> job in node (and # of "running"
    >                        jobs in db = 0), but partial db
    >                        info is missing. Assume db update
    >                        failed.
    >
    >
    >
    


    This archive was generated by hypermail 2.1.5 : Tue Feb 11 2003 - 08:56:02 EST