disk issues -- RCF

From: Flemming Videbaek (videbaek@sgs1.hirg.bnl.gov)
Date: Mon Jul 14 2003 - 12:22:44 EDT

  • Next message: Ian Bearden: "bsub problems"
    FYI
    
    I am reading thorugh my many mail messages since returneing today from vacation.
    
    The following seems related to the problems Stve and others had with data07.
    There are two mails from Tom Throwe, since the 2 seems to explain the final outcome I put it first.
    Has the solved the problem to everyone satisfcation ?
    
    Flemming
    
    
    I/
    Dear Flemming,
    
    As I said in my previous message to you, the initial problem with the 
    data disks on the BRAHMS machines was that the patches for file systems 
    greater than 1 TB had not been applied.  The decision was made to 
    immediately upgrade the machines to our installation of Solaris 9, which 
    contained the latest version of Veritas, to fix the problem.  What has 
    since been determined is that the Veritas for Solaris 9 was not built 
    with the patches and the patches had to also be applied to the new 
    Veritas for Solaris 9.  The reason the first machine's additional 
    upgrade took so long was that the Veritas support team told Maurice to 
    apply the patches he had for Solaris 8 to the Solaris 9 version and, of 
    course, it did not work.  Maurice finally went to their Web site and 
    found the correct patch file and applied it.  Tests indicated that the 
    file systems on the first machine were repaired.  Maurice was then able 
    to quickly apply the correct patches to the second machine and it came 
    up very quickly.
    
    I am very sorry for the length of time it took to resolve this problem. 
      We now believe that the system is stable and Maurice plans to copy 
    anything missing from data07 that is on the partial backup version of 
    data07 back to the original file system.
    
    Please let me know if you have any further questions.
    
    Regards,
    Tom
    -- 
    -----------------------------------------------------
    Thomas G. Throwe       E-mail: throwe@bnl.gov
    RCF -- RHIC Project -- Brookhaven National Laboratory
    WWW: http://www.rhic.bnl.gov/RCF/
    Phone: (631) 344-3110  Fax: (631) 344-7616
    -----------------------------------------------------
    
    
    
    II/
    
    Dear Flemming,
    
    As you know, the underlying problem with the disk was the lack of a 
    Veritas patch to allow file systems greater than 1 TB.  All of the 
    machines are now at Solaris 9 and the latest levels of patching, so most 
    of the file systems are in good shape.  The question is what to do with 
    the three problem file systems.  Here is the summary:
    
    data07: this was the file system with the initial problem. It is online, 
    but since the metadata was corrupted, an fsck of the file system does 
    not fix it.  It is believed that anything that can be read from the disk 
    is OK, but nothing should be written.
    
    data08: this file system was used to try to backup data from data07, 
    resulting in it getting corrupted when it went over 1 TB.  Currently in 
    same condition as data07.
    
    data09: similar to data08, except that fsck completely fails with the 
    error that a critical block is missing.  Maurice has been trying to get 
    help on this particular problem from Veritas for over a week with them 
    only asking for more information and not providing anything.
    
    Now for the options:
    
    a) simplest is to blow away all three file systems and rebuild.  This 
    would result in the loss of almost all of the data (there is another 
    copy of part of data07. Maurice does not have much free disk space 
    available to play with, but he used some of what he had conveniently and 
    copied data07 early on until it filled that space.  This file system can 
    be made available and is expected to be OK, but it is not a complete 
    copy of data07.)
    
    b) continue to wait for Veritas input on data09 - no other option here 
    except a rebuild since fsck fails.
    
    c) try to shrink the data07 and data08 file systems below the 1 TB mark 
    to see if this will repair the metadata.  Veritas has not given any 
    advice on this option yet. There is a chance that this will work, but 
    there is also a chance that it will destroy the file system (since 
    shrinking is done by defragmenting and this, potentially, could try to 
    move a lot of the data and corrupt it in the process).
    
    Maurice does not want to proceed without some guidance.  If you feel 
    that, since people can write to other file systems and read the 
    non-corrupted files from data07 and data08, it is OK to wait for Veritas 
    input on all three file systems, then that may be the safest option.
    
    If you are willing to take the risk of shrinking the file systems, then 
    Maurice can try that. We just need your input before trying this.
    
    If you have any comments, or need more information, let me know.
    
    Tom
    
    -- 
    -----------------------------------------------------
    Thomas G. Throwe       E-mail: throwe@bnl.gov
    RCF -- RHIC Project -- Brookhaven National Laboratory
    WWW: http://www.rhic.bnl.gov/RCF/
    Phone: (631) 344-3110  Fax: (631) 344-7616
    -----------------------------------------------------
    
    ------------------------------------------------------
    Flemming Videbaek
    Physics Department
    Brookhaven National Laboratory
    
    tlf: 631-344-4106
    fax 631-344-1334
    e-mail: videbaek@bnl.gov
    


    This archive was generated by hypermail 2.1.5 : Mon Jul 14 2003 - 12:23:02 EDT