Re: [Brahms-dev-l] Proof sessions

From: Flemming Videbaek <videbaek_at_bnl.gov>
Date: Sat, 5 May 2007 14:13:00 -0400
Hi Selemon,

I belive that proof session die when they reach an Error, even if the local jobs run. 
Under some circumstances I have seen this. The condor jobs running is in my opinion a red heering in regard to proof,
though there are other issues with that.

Flemming




--------------------------------------------
Flemming Videbaek
Physics Department 
Bldg 510-D
Brookhaven National Laboratory
Upton, NY11973

phone: 631-344-4106
cell:       631-681-1596
fax:        631-344-1334
e-mail: videbaek @ bnl gov
----- Original Message ----- 
From: "Bekele, Selemon" <bekeleku_at_ku.edu>
To: "Flemming Videbaek" <videbaek_at_bnl.gov>
Cc: "devlist" <brahms-dev-l_at_lists.bnl.gov>
Sent: Saturday, May 05, 2007 2:09 PM
Subject: RE: [Brahms-dev-l] Proof sessions




Hi Flemming,

   I have the run the code locally and saw no problem with the
code. I have ralso un with proof with all events from all files for
 the setting 90B350 the whole afternoon yesterday and saw no 
problems. 
 
   As for the errors below, I will clean up my code but they do
not seem to matter. Looking at the PROOF progress monitor, the 
second session is stuck when only 0.8 seconds are left for the
 session to finish.

  I still think that whenever another cpu intensive job is
running on the PROOF nodes, the sessions are suspended.

Selemon,

-----Original Message-----
From: Flemming Videbaek [mailto:videbaek_at_bnl.gov]
Sent: Sat 5/5/2007 12:03 PM
To: Bekele, Selemon
Cc: devlist
Subject: Re: [Brahms-dev-l] Proof sessions
 
Hi Selemon,

The problem is with the files and/or code used.
When I look at /var/log/ROOT.log 
I see --
  i.e. has nothing to do with condor.
Try to run your session with just a few 1000' events and get that to work before submitting.
I also recommend you go to each node and kill the proofserv (3 on 41, two on subsequent nodes)

Flemming



> FL.fSi1bMult
May  4 22:09:58 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FL.fSi1cMult
May  4 22:09:59 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FL.fSi1dMult
May  4 22:09:59 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FL.fSi1eMult
May  4 22:09:59 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FL.fSi1fMult
May  4 22:09:59 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FL.fSi1gMult
May  4 22:09:59 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FL.fSi1aEta
May  4 22:09:59 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FL.fSi1bEta
May  4 22:09:59 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FL.fSi1cEta
May  4 22:09:59 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FL.fSi1dEta
May  4 22:09:59 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FL.fSi1eEta
May  4 22:09:59 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FL.fSi1fEta
May  4 22:09:59 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FL.fSi1gEta
May  4 22:09:59 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FS.fVertexFlag
May  4 22:10:00 rcas0041 proofslave[9165]: tigist:slave 0.0:Error:<TTree::SetBranchAddress>:unknown branch -> FFS.fVertexFlag
May  4 22:10:02 rcas0041 proofslave[9165]: !!!cleanup!!!

--------------------------------------------
Flemming Videbaek
Physics Department 
Bldg 510-D
Brookhaven National Laboratory
Upton, NY11973

phone: 631-344-4106
cell:       631-681-1596
fax:        631-344-1334
e-mail: videbaek @ bnl gov
----- Original Message ----- 
From: "Bekele, Selemon" <bekeleku_at_ku.edu>
To: <brahms-dev-l_at_lists.bnl.gov>
Sent: Saturday, May 05, 2007 12:52 PM
Subject: [Brahms-dev-l] Proof sessions


> 
> Hi,
> 
>   I have been trying to run proof sessions 
> (6 centrality bins X 6 field settings = 30 sessions)
> with the master node on rcas0041. I run the sessions 
> sequentially for each centrality from a shell script 
> and only the very first session has finished
> since 9:00 PM friday night and the second session
> is suspended which means the subsequent runs could
> not be done.
> 
> Doing
> 
> rcas0041:> condor_status -claimed
> 
> I see:
> 
> vm1_at_rcas0041. LINUX       INTEL  0.820  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm2_at_rcas0041. LINUX       INTEL  0.870  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm1_at_rcas0042. LINUX       INTEL  0.000  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm2_at_rcas0042. LINUX       INTEL  0.000  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm1_at_rcas0043. LINUX       INTEL  0.800  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm2_at_rcas0043. LINUX       INTEL  0.820  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm1_at_rcas0044. LINUX       INTEL  0.830  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm2_at_rcas0044. LINUX       INTEL  0.860  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm1_at_rcas0045. LINUX       INTEL  0.000  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm2_at_rcas0045. LINUX       INTEL  0.000  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm1_at_rcas0046. LINUX       INTEL  0.000  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm2_at_rcas0046. LINUX       INTEL  0.000  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm1_at_rcas0047. LINUX       INTEL  0.750  claudius_at_bnl.gov     rcas2065.rcf.bn
> vm2_at_rcas0047. LINUX       INTEL  0.710  claudius_at_bnl.gov     rcas2065.rcf.bn
> 
> 
> It seems like the proof sessions are suspended because,
> I think, someone is running cpu intensive jobs on the
> BRAHMS rcas machines. I do not think changing to a different
> master node would help as all the BRAHMS machines seem to be
> taken.
> 
> Has anyone faced the same problem and found a quick solution or
> I just need to wait out until the machines become free?
> 
> Selemon,
> 
> _______________________________________________
> Brahms-dev-l mailing list
> Brahms-dev-l_at_lists.bnl.gov
> https://lists.bnl.gov/mailman/listinfo/brahms-dev-l
>


_______________________________________________
Brahms-dev-l mailing list
Brahms-dev-l_at_lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/brahms-dev-l
Received on Sat May 05 2007 - 14:13:37 EDT

This archive was generated by hypermail 2.2.0 : Sat May 05 2007 - 14:14:01 EDT