Hi Flemming, I have only one TDSet * set; set->Process("selector","",..) per process. Below is a function called to start the proof session: ================ void StartProof(Int_t MainNode){ TString cluster = Form("rcas00%02d",MainNode); TString confFile = Form("proof_rcas00%02d.conf",MainNode); //gROOT->Proof(cluster.Data(),confFile.Data()); //fProof = new TProof(cluster.Data(),confFile.Data()); //does not work fProof = TProof::Open(cluster.Data(),confFile.Data()); //added to work with new OS //fProof->SetParameter("PROOF_MaxSlavesPerNode",9999); //does not compile with int Long_t maxSlavePerNode = 9999; fProof->SetParameter("PROOF_MaxSlavesPerNode",maxSlavePerNode); //fProof->Open(cluster.Data(),confFile.Data()); gProof->UploadPackage("BratLibrary.par"); gProof->EnablePackage("BratLibrary"); } ================ Selemon, -----Original Message----- From: Flemming Videbaek [mailto:videbaek_at_bnl.gov] Sent: Thu 9/20/2007 11:58 AM To: Bekele, Selemon Cc: JH Lee; brahms-dev-l_at_lists.bnl.gov Subject: proof Hiw I would really like to know how you access the nodes and run i.e. do you have many TDSet * set; set->Process("selector","",..) in a sequence. I have seen that such can increase in running processes. I also see that in the session running right now- only the process on 62 gets any cpu time. Where when do you do the _>SetParameter("PROOF_MaxSlavesPerNode .. ? It does look peciluar. I do know that not all memory is released at the end of a Process() from the slaves. Flemming -------------------------------------------- Flemming Videbaek Physics Department Bldg 510-D Where/when do you set the Brookhaven National Laboratory Upton, NY11973 phone: 631-344-4106 cell: 631-681-1596 fax: 631-344-1334 e-mail: videbaek @ bnl gov ----- Original Message ----- From: "Bekele, Selemon" <bekeleku_at_ku.edu> To: "Flemming Videbaek" <videbaek_at_bnl.gov> Cc: "JH Lee" <jhlee_at_bnl.gov>; <brahms-dev-l_at_lists.bnl.gov> Sent: Thursday, September 20, 2007 12:51 PM Subject: RE: [Brahms-dev-l] analysis meeting Hi Flemming, I have been monitoring my proof sessions for memory size. about an hour and 20 minutes into the sessions: the memory size on the slaves (63 - 68) is about 235 MB. the memory size on the slaves on 62 is about 2 GB. Is there any reason why the memory size on the 62 slaves should grow to about 10 times those on the other slave machines? Uneven sharing of loads between the slaves? Selemon, ============================ rcas0062: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND tigist 26091 1.3 2.7 127924 56976 ? Ss 11:19 0:56 /opt/brahms/pro/bin/proofserv.exe proofserv tigist 26577 68.7 40.9 2108652 849448 ? Rs 11:20 47:53 /opt/brahms/pro/bin/proofserv.exe proofslave tigist 26578 68.2 40.6 2053872 843488 ? Ds 11:20 47:30 /opt/brahms/pro/bin/proofserv.exe proofslave rcas0063: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND tigist 14298 0.0 9.8 235092 203924 ? Ss 11:20 0:04 /opt/brahms/pro/bin/proofserv.exe proofslave tigist 14300 0.0 9.8 234548 203928 ? Ss 11:20 0:04 /opt/brahms/pro/bin/proofserv.exe proofslave rcas0064: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND tigist 15736 0.0 9.8 233564 203924 ? Ss 11:20 0:04 /opt/brahms/pro/bin/proofserv.exe proofslave tigist 15737 0.0 9.8 235480 203924 ? Ss 11:20 0:04 /opt/brahms/pro/bin/proofserv.exe proofslave rcas0065: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND tigist 16870 0.0 9.8 234236 203928 ? Ss 11:20 0:04 /opt/brahms/pro/bin/proofserv.exe proofslave tigist 16871 0.0 9.8 235148 203928 ? Ss 11:20 0:04 /opt/brahms/pro/bin/proofserv.exe proofslave rcas0066: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND tigist 16629 0.0 9.8 234880 203920 ? Ss 11:20 0:04 /opt/brahms/pro/bin/proofserv.exe proofslave tigist 16634 0.0 9.8 235464 203924 ? Ss 11:20 0:04 /opt/brahms/pro/bin/proofserv.exe proofslave rcas0067: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND tigist 9125 0.0 9.8 233836 203932 ? Ss 11:20 0:04 /opt/brahms/pro/bin/proofserv.exe proofslave tigist 9130 0.0 9.8 234328 203920 ? Ss 11:20 0:03 /opt/brahms/pro/bin/proofserv.exe proofslave rcas0068: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND tigist 2143 0.0 9.8 235068 203928 ? Ss 11:20 0:04 /opt/brahms/pro/bin/proofserv.exe proofslave tigist 2145 0.0 9.8 234620 203928 ? Ss 11:20 0:04 /opt/brahms/pro/bin/proofserv.exe proofslave -----Original Message----- From: Flemming Videbaek [mailto:videbaek_at_bnl.gov] Sent: Tue 9/18/2007 3:07 PM To: Bekele, Selemon Cc: JH Lee Subject: Re: [Brahms-dev-l] analysis meeting Hi Selemon, I see you are running on rcas0062 proofserv.exe slave or maybe you are not - in anycase there memory size is 2.3Gb per process. \Are you sure the process do not have memory leaks ? Flemming -------------------------------------------- Flemming Videbaek Physics Department Bldg 510-D Brookhaven National Laboratory Upton, NY11973 phone: 631-344-4106 cell: 631-681-1596 fax: 631-344-1334 e-mail: videbaek @ bnl gov ----- Original Message ----- From: "Bekele, Selemon" <bekeleku_at_ku.edu> To: "Flemming Videbaek" <videbaek_at_bnl.gov>; "devlist" <brahms-dev-l_at_lists.bnl.gov> Sent: Tuesday, September 18, 2007 1:40 PM Subject: RE: [Brahms-dev-l] analysis meeting Hi Flemming, In order to make sure that I am not doing something wrong, I run again over the 57 files individually. This time it is a different run, 13345, that had empty histograms. I run over the same file locally and with proof and found no problem with the file. The error message from /var/log/ROOT.log on rcas0055: ======================= Sep 17 22:40:13 rcas0055 proofserv[22320]: tigist:master-0:SysError:<TUnixSystem::DispatchOneEvent>:select: read error on 4 (Bad file descriptor) Sep 17 22:40:13 rcas0055 last message repeated 3 times Sep 17 22:40:18 rcas0055 proofserv[22320]: tigist:master-0:SysError:<TUnixSystem::DispatchOneEvent>:select: read error on 4 (Bad file descriptor) ====================== Part of the log file in proof (~tigist/ProofOutPut.dat) is shown below. it seems that a connection to some machine is reset at the very beginning of the session. Could this be a reset of connection to a database in brahms? ======================= (Int_t)(1) (Int_t)(1) (Int_t)(1) (Int_t)(1) (Int_t)(1) (Int_t)(1) (Int_t)(1) (Int_t)(1) (Int_t)(1) (Int_t)(1) (Int_t)(1) (Int_t)(1) (Int_t)(1) (Int_t)(1) (Int_t)(1) Info in <TProofServ::SetQueryRunning> on master-0: starting query: 1 Info in <TAdaptivePacketizer::TAdaptivePacketizer> on master-0: fraction of remote files 1.000000 SysError in <TUnixSystem::UnixSend> on master-0: send (Connection reset by peer) SysError in <TUnixSystem::DispatchOneEvent> on master-0: select: read error on 4 (Bad file descriptor) ===================== The problem I am facing seems to be quite random. As for the meeting on the coming friday, I need to resolve this issue before since I do not have anything new after the rcf upgrades. Selemon, -----Original Message----- From: brahms-dev-l-bounces_at_lists.bnl.gov on behalf of Flemming Videbaek Sent: Tue 9/18/2007 9:34 AM To: devlist Subject: [Brahms-dev-l] analysis meeting We will have an analysis meeting this coming Friday. There are planned presentation from Selemon on CuCu and on auau at 62 from Ionut The agenda page on indico has been setup for this meeting. Flemming -------------------------------------------- Flemming Videbaek Physics Department Bldg 510-D Brookhaven National Laboratory Upton, NY11973 phone: 631-344-4106 cell: 631-681-1596 fax: 631-344-1334 e-mail: videbaek @ bnl gov _______________________________________________ Brahms-dev-l mailing list Brahms-dev-l_at_lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/brahms-dev-lReceived on Thu Sep 20 2007 - 13:36:44 EDT
This archive was generated by hypermail 2.2.0 : Thu Sep 20 2007 - 13:37:15 EDT