From: Flemming Videbaek (videbaek@sgs1.hirg.bnl.gov)
Date: Mon Jul 14 2003 - 12:22:44 EDT
FYI I am reading thorugh my many mail messages since returneing today from vacation. The following seems related to the problems Stve and others had with data07. There are two mails from Tom Throwe, since the 2 seems to explain the final outcome I put it first. Has the solved the problem to everyone satisfcation ? Flemming I/ Dear Flemming, As I said in my previous message to you, the initial problem with the data disks on the BRAHMS machines was that the patches for file systems greater than 1 TB had not been applied. The decision was made to immediately upgrade the machines to our installation of Solaris 9, which contained the latest version of Veritas, to fix the problem. What has since been determined is that the Veritas for Solaris 9 was not built with the patches and the patches had to also be applied to the new Veritas for Solaris 9. The reason the first machine's additional upgrade took so long was that the Veritas support team told Maurice to apply the patches he had for Solaris 8 to the Solaris 9 version and, of course, it did not work. Maurice finally went to their Web site and found the correct patch file and applied it. Tests indicated that the file systems on the first machine were repaired. Maurice was then able to quickly apply the correct patches to the second machine and it came up very quickly. I am very sorry for the length of time it took to resolve this problem. We now believe that the system is stable and Maurice plans to copy anything missing from data07 that is on the partial backup version of data07 back to the original file system. Please let me know if you have any further questions. Regards, Tom -- ----------------------------------------------------- Thomas G. Throwe E-mail: throwe@bnl.gov RCF -- RHIC Project -- Brookhaven National Laboratory WWW: http://www.rhic.bnl.gov/RCF/ Phone: (631) 344-3110 Fax: (631) 344-7616 ----------------------------------------------------- II/ Dear Flemming, As you know, the underlying problem with the disk was the lack of a Veritas patch to allow file systems greater than 1 TB. All of the machines are now at Solaris 9 and the latest levels of patching, so most of the file systems are in good shape. The question is what to do with the three problem file systems. Here is the summary: data07: this was the file system with the initial problem. It is online, but since the metadata was corrupted, an fsck of the file system does not fix it. It is believed that anything that can be read from the disk is OK, but nothing should be written. data08: this file system was used to try to backup data from data07, resulting in it getting corrupted when it went over 1 TB. Currently in same condition as data07. data09: similar to data08, except that fsck completely fails with the error that a critical block is missing. Maurice has been trying to get help on this particular problem from Veritas for over a week with them only asking for more information and not providing anything. Now for the options: a) simplest is to blow away all three file systems and rebuild. This would result in the loss of almost all of the data (there is another copy of part of data07. Maurice does not have much free disk space available to play with, but he used some of what he had conveniently and copied data07 early on until it filled that space. This file system can be made available and is expected to be OK, but it is not a complete copy of data07.) b) continue to wait for Veritas input on data09 - no other option here except a rebuild since fsck fails. c) try to shrink the data07 and data08 file systems below the 1 TB mark to see if this will repair the metadata. Veritas has not given any advice on this option yet. There is a chance that this will work, but there is also a chance that it will destroy the file system (since shrinking is done by defragmenting and this, potentially, could try to move a lot of the data and corrupt it in the process). Maurice does not want to proceed without some guidance. If you feel that, since people can write to other file systems and read the non-corrupted files from data07 and data08, it is OK to wait for Veritas input on all three file systems, then that may be the safest option. If you are willing to take the risk of shrinking the file systems, then Maurice can try that. We just need your input before trying this. If you have any comments, or need more information, let me know. Tom -- ----------------------------------------------------- Thomas G. Throwe E-mail: throwe@bnl.gov RCF -- RHIC Project -- Brookhaven National Laboratory WWW: http://www.rhic.bnl.gov/RCF/ Phone: (631) 344-3110 Fax: (631) 344-7616 ----------------------------------------------------- ------------------------------------------------------ Flemming Videbaek Physics Department Brookhaven National Laboratory tlf: 631-344-4106 fax 631-344-1334 e-mail: videbaek@bnl.gov
This archive was generated by hypermail 2.1.5 : Mon Jul 14 2003 - 12:23:02 EDT