Re: Shift report 20000817 00:00-08:00

From: Konstantin Olchanski (olchansk@ux1.phy.bnl.gov)
Date: Thu Aug 17 2000 - 13:11:45 EDT

Next message: http account: "Shift report 20000817 08:00-16:00"

Previous message: http account: "Shift report 20000817 00:00-08:00"
In reply to: http account: "Shift report 20000817 00:00-08:00"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> ------------------------------------------------------------
> Date: Thu Aug 17 08:01:27 2000, Shift: 00:00-08:00
> ------------------------------------------------------------
>    Problem with the DAQ: At 04:20 we discovered that 
>    the .dat file for a sequence was deleted from the 
>    spool dir when a new sequence started. The DAQ 
>    window gave no error messages. Only the .snd files 
>    (it seems) are saved.
> ------------------------------------------------------------


The DAQ problem was caused by a combination of two bugs in the
two scripts that move data from DAQ spool on opus to HPSS and
delete old files.

The result was that data taken from around midnight until around 9:00,
about 50 sequence files, was lost.

The fix for one bug was trivial, and the fix for the second bug
is being tested as I write this message.

For the curious, this is what happened:

The HPSS transfer program "pftp" has hung at "Mon Aug 14 12:24:41 EDT 2000"
in a place not protected by an automatic timeout. A few other places
where "pftp" can hang are already protected. If "pftp" hangs, the script
is automatically restarted, and this protection has already recovered
from a number of "pftp" hangs.

After "pftp" hung, and until I manually restarted the "hpssSend" script
(through "DAQ control" -> "Tasks" -> "hpssSend" -> "restart"),
no data was moved to HPSS and the opus spool disk filled up.
This could be seen on the "DAQ Spool status" web page, off the main DAQ
page on pii3. Once I restarted the "hpssSend" script, all the accumulated
data promptly and happily marched off to HPSS.

Normally, when HPSS transfers sopt/hang, the spool disk would fill up
and the event builder will refuse to write data to spool and hpss. An
attempt to start a run and write data to spool/hpss would result
in a "Run error" and the run would stop.

This protection against data loss did not work this time. There was no
error message nor alarm for the operators. This is why:

Once the disk filled up, the bug in the disk cleaning program started doing
the real damage. The program would not delete any files already queued
for transfer to HPSS, but because of a design error, it *would*
delete the sequence files ***while they were written by the event builder***,
because they were not yet marked as "queued for hpss transfer". So
the event builder would open a new sequence file, start writing into it,
and a few minutes later, even before the event builder would close the file,
the disk cleaning program would come along and delete it. Quite a nasty error.

This design error was fixing by changing the algorithm used to select and
mark files for deletion. The corrected cleaning program would only
delete the ".dat" files that have a corresponding "delete-me" file
with a ".del" suffix. These "delete-me" files are only created by
the spool manager after the data file was moved to HPSS.

While I am sure that we will see more HPSS hangs, hopefully they will
no longer cause us to loose already collected data...


-- 
Konstantin Olchanski
Physics Department, Brookhaven National Laboratory, Long Island, New York
olchansk@bnl.gov

Next message: http account: "Shift report 20000817 08:00-16:00"
Previous message: http account: "Shift report 20000817 00:00-08:00"
In reply to: http account: "Shift report 20000817 00:00-08:00"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Aug 17 2000 - 13:15:03 EDT