Hello. I found the cause of our problem except the fact that we can not fix it. The problem was due to the disk IO limitation of rmine002. rmine002 and rmine003 are older (or slower) NSF server. This seems to limit our IO capability before reaching network limitaion. Therefore, if you have IO intensive jobs with rmine002 (data01-data06) or rmine003 (data10-data14), I highly suggests you submit smaller numbers of jobs since your limination is IO (and not CPU). (Your jobs do not run any faster by running a lot of jobs in that case. Only thing you will do is to limit other peope submitting jobs by taking up Condor queues.) The easiest way to find this is use "top" on the machines where your jobs are running. If you see IO wait of more than ~10%, you are already reaching IO limit. Hiro Hironori Ito wrote: > Hello. Currenty (about 5PM), our network in RCF is very slow. (eg. > IO wait of my jobs by top is 95%.) I am not sure who or what is > causing this massive IO problem, but I would like to find it out. Can > you tell me what people are doing at this time (again about 5PM > Thursday Sep 1st)? It would be helpfull if you can tell me where jobs > (condor or whatever, including interactive jobs) are running and > which disks you are accessing (eg. /brahms/data21 or, > /brahms/u/username or etc..). > > Hiro > > _______________________________________________ > Brahms-dev-l mailing list > Brahms-dev-l@lists.bnl.gov > http://lists.bnl.gov/mailman/listinfo/brahms-dev-l _______________________________________________ Brahms-dev-l mailing list Brahms-dev-l@lists.bnl.gov http://lists.bnl.gov/mailman/listinfo/brahms-dev-lReceived on Tue Sep 6 11:54:51 2005
This archive was generated by hypermail 2.1.8 : Tue Sep 06 2005 - 11:54:57 EDT