Re: Shift report 20000817 00:00-08:00

From: Konstantin Olchanski (olchansk@ux1.phy.bnl.gov)
Date: Thu Aug 17 2000 - 13:11:45 EDT

  • Next message: http account: "Shift report 20000817 08:00-16:00"

    > ------------------------------------------------------------
    > Date: Thu Aug 17 08:01:27 2000, Shift: 00:00-08:00
    > ------------------------------------------------------------
    >    Problem with the DAQ: At 04:20 we discovered that 
    >    the .dat file for a sequence was deleted from the 
    >    spool dir when a new sequence started. The DAQ 
    >    window gave no error messages. Only the .snd files 
    >    (it seems) are saved.
    > ------------------------------------------------------------
    
    
    The DAQ problem was caused by a combination of two bugs in the
    two scripts that move data from DAQ spool on opus to HPSS and
    delete old files.
    
    The result was that data taken from around midnight until around 9:00,
    about 50 sequence files, was lost.
    
    The fix for one bug was trivial, and the fix for the second bug
    is being tested as I write this message.
    
    For the curious, this is what happened:
    
    The HPSS transfer program "pftp" has hung at "Mon Aug 14 12:24:41 EDT 2000"
    in a place not protected by an automatic timeout. A few other places
    where "pftp" can hang are already protected. If "pftp" hangs, the script
    is automatically restarted, and this protection has already recovered
    from a number of "pftp" hangs.
    
    After "pftp" hung, and until I manually restarted the "hpssSend" script
    (through "DAQ control" -> "Tasks" -> "hpssSend" -> "restart"),
    no data was moved to HPSS and the opus spool disk filled up.
    This could be seen on the "DAQ Spool status" web page, off the main DAQ
    page on pii3. Once I restarted the "hpssSend" script, all the accumulated
    data promptly and happily marched off to HPSS.
    
    Normally, when HPSS transfers sopt/hang, the spool disk would fill up
    and the event builder will refuse to write data to spool and hpss. An
    attempt to start a run and write data to spool/hpss would result
    in a "Run error" and the run would stop.
    
    This protection against data loss did not work this time. There was no
    error message nor alarm for the operators. This is why:
    
    Once the disk filled up, the bug in the disk cleaning program started doing
    the real damage. The program would not delete any files already queued
    for transfer to HPSS, but because of a design error, it *would*
    delete the sequence files ***while they were written by the event builder***,
    because they were not yet marked as "queued for hpss transfer". So
    the event builder would open a new sequence file, start writing into it,
    and a few minutes later, even before the event builder would close the file,
    the disk cleaning program would come along and delete it. Quite a nasty error.
    
    This design error was fixing by changing the algorithm used to select and
    mark files for deletion. The corrected cleaning program would only
    delete the ".dat" files that have a corresponding "delete-me" file
    with a ".del" suffix. These "delete-me" files are only created by
    the spool manager after the data file was moved to HPSS.
    
    While I am sure that we will see more HPSS hangs, hopefully they will
    no longer cause us to loose already collected data...
    
    
    -- 
    Konstantin Olchanski
    Physics Department, Brookhaven National Laboratory, Long Island, New York
    olchansk@bnl.gov
    



    This archive was generated by hypermail 2b29 : Thu Aug 17 2000 - 13:15:03 EDT