Re: [Brahms-dev-l] who or what is running jobs to strain our network???

From: Hironori Ito <hito@rcf.rhic.bnl.gov>
Date: Tue Sep 06 2005 - 11:51:37 EDT
Hello.  I found the cause of our problem except the fact that we can not 
fix it.  The problem was due to the disk IO limitation of rmine002.  
rmine002 and rmine003 are older (or slower) NSF server.  This seems to 
limit our IO capability before reaching network limitaion.  Therefore, 
if you have IO intensive jobs with rmine002 (data01-data06) or rmine003 
(data10-data14), I highly suggests you submit smaller numbers of jobs 
since your limination is IO (and not CPU).  (Your jobs do not run any 
faster by running a lot of jobs in that case.  Only thing you will do is 
to limit other peope submitting jobs by taking up Condor queues.)   The 
easiest way to find this is use "top" on the machines where your jobs 
are running.  If you see IO wait of more than ~10%, you are already 
reaching IO limit.  

Hiro


Hironori Ito wrote:

> Hello.  Currenty (about 5PM), our network in RCF is very slow.  (eg.  
> IO wait of my jobs by top is 95%.)  I am not sure who or what is 
> causing this massive IO problem, but I would like to find it out.  Can 
> you tell me what people are doing at this time (again about 5PM 
> Thursday Sep 1st)?  It would be helpfull if you can tell me where jobs 
> (condor or whatever, including  interactive jobs) are running and 
> which disks you are accessing  (eg.  /brahms/data21 or, 
> /brahms/u/username or etc..).
>
> Hiro
>
> _______________________________________________
> Brahms-dev-l mailing list
> Brahms-dev-l@lists.bnl.gov
> http://lists.bnl.gov/mailman/listinfo/brahms-dev-l



_______________________________________________
Brahms-dev-l mailing list
Brahms-dev-l@lists.bnl.gov
http://lists.bnl.gov/mailman/listinfo/brahms-dev-l
Received on Tue Sep 6 11:54:51 2005

This archive was generated by hypermail 2.1.8 : Tue Sep 06 2005 - 11:54:57 EDT