From: Flemming Videbaek (videbaek@sgs1.hirg.bnl.gov)
Date: Thu Aug 07 2003 - 11:30:31 EDT
RCF 8/7/2003 Disk Software problem caused the massive breakdown (Thursday). I.e. the monitoring software; want to return but company not willing so far. One do need monitoring to catch the small hardware problems as they develop. - NFS sometimes die under heavy load. Look at increase auto mount time, and # simultaneous thread (increased from 256->512). Did not make worse - will be increased for all machines as they are rebooted. - Upgrades all done (patches) Phobos complained that no notice were posted that problems Management agree that any in RCF could have notified users; Linux OS upgrade this month - start August 25; Frozen version on new machine. Phenix could not get OBJy to work with RH9. Will go with RH8. Condor on ~500 machines. Jobs only go to experiment owned machines.. Condor suspends , but does not free resources i.e. memory. Single manager maintains all machines. The config on individual machines limits access from different exp. LSF A algorithm error was found in fair share where old load factor were not erased. New images + libraries installed. HPSS Htar defaults changed. HIS needs kerebos activated. Solaris kerebos Password can only be changed on gateway. Some people have problems. (linux pamstack problem) Will be fixed in newer versions. -- schedule for afs changeover. Need long lead time for user and outside notification. I.e. the RHIC cell name will change. ------------------------------------------------------ Flemming Videbaek Physics Department Brookhaven National Laboratory tlf: 631-344-4106 fax 631-344-1334 e-mail: videbaek@bnl.gov
This archive was generated by hypermail 2.1.5 : Thu Aug 07 2003 - 11:31:19 EDT