Re: [Brahms-dev-l] proof

From: Bekele, Selemon <bekeleku_at_ku.edu>
Date: Mon, 17 Sep 2007 17:46:53 -0500
Hi Flemming,

   I have been checking for bad files at 40A1050
setting. I have put some plots at

(1) http://www4.rcf.bnl.gov/~tigist/Proof/MomDist_40A1050.gif  
(2) http://www4.rcf.bnl.gov/~tigist/Proof/Runs13181_13259.gif  
(3) http://www4.rcf.bnl.gov/~tigist/Proof/Runs13260_13344.gif  
(4) http://www4.rcf.bnl.gov/~tigist/Proof/Runs13345_13404.gif

(1) is momentum distribution from the dst files for each run.
(2),(3),(4) are data obtained with proof obtained on saturday
Sept. 14. Except for runs 13185, 13196 all other runs have data
in them. 

    I checked /var/log/ROOT.log on rcas0055 and I saw 

    *** Break *** segmentation violation

in two places. Just to be sure, I run with proof over runs 13185, 13196
again and I get data, the results are shown in the last 2 histograms in (4)

     SO IT IS NOT CLEAR TO ME WHY THINGS ARE NOT STABLE FOR ME.

   I switched to master node rcas0007 and run over all files in asingle 
proof session. I was monitoring the proof display and noticed that about
30 seconds before the end of the session, the session broke with a message

=====================
 SysError in <TFile::Flush>: error flushing file hists/Test40A1050_cucu200_pions_MRS_0_10.root (Stale NFS file handle)
Output file closed
=====================

which might explain why I am getting empty histograms.

If it give any hint, here is part of the log file in
/var/log/ROOT.log on rcas0007

======================
Sep 17 15:31:30 rcas0007 proofserv[10740]: tigist:master-0:Info:<TProofServ::SetQueryRunning>:starting query: 1
Sep 17 15:31:39 rcas0007 proofserv[10740]: tigist:master-0:Info:<TAdaptivePacketizer::TAdaptivePacketizer>:fraction of remote files 1.000000
Sep 17 16:32:57 rcas0007 proofserv[10740]: !!!cleanup!!!
Sep 17 16:32:58 rcas0007 proofslave[10747]: tigist:worker-0.1:Error:<TProofServ::HandleSocketInputDuringProcess>:unknown command 1000
Sep 17 16:32:59 rcas0007 proofslave[10747]: tigist:worker-0.1:Error:<TProofServ::GetNextPacket>:unexpected answer message type: 1027
Sep 17 16:33:05 rcas0007 proofslave[10747]: tigist:worker-0.1:Error:<TProofServ::HandleSocketInput>:unknown command 1011
Sep 17 17:10:11 rcas0007 proofserv[10740]: tigist:master-0:Error:<TProofServ::HandleSocketInput>:retrieving message from input socket
Sep 17 17:10:11 rcas0007 proofslave[10747]: tigist:worker-0.1:Error:<TProofServ::HandleSocketInput>:retrieving message from input socket
Sep 17 17:10:13 rcas0007 proofslave[10749]: tigist:worker-0.3:Error:<TProofServ::HandleSocketInput>:retrieving message from input socket
Sep 17 17:10:13 rcas0007 proofslave[10748]: tigist:worker-0.2:Error:<TProofServ::HandleSocketInput>:retrieving message from input socket
=======================

For normal completion, I usually get something like

============================
Sep 17 15:31:30 rcas0007 proofserv[10740]: tigist:master-0:Info:<TProofServ::SetQueryRunning>:starting query: 1
Sep 17 15:31:39 rcas0007 proofserv[10740]: tigist:master-0:Info:<TAdaptivePacketizer::TAdaptivePacketizer>:fraction of remote files 1.000000
Sep 17 16:32:57 rcas0007 proofserv[10740]: !!!cleanup!!!
Sep 17 17:10:11 rcas0007 proofserv[10740]: tigist:master-0:Error:<TProofServ::HandleSocketInput>:retrieving message from input socket
Sep 17 17:10:11 rcas0007 proofslave[10747]: tigist:worker-0.1:Error:<TProofServ::HandleSocketInput>:retrieving message from input socket
Sep 17 17:10:13 rcas0007 proofslave[10749]: tigist:worker-0.3:Error:<TProofServ::HandleSocketInput>:retrieving message from input socket
Sep 17 17:10:13 rcas0007 proofslave[10748]: tigist:worker-0.2:Error:<TProofServ::HandleSocketInput>:retrieving message from input socket
=============================

On rcas0055, I see various kinds of error messages like

============================
Sep 17 14:52:50 rcas0055 proofslave[14487]: tigist:worker-0.0:Error:<TProofServ::GetNextPacket>:unexpected answer message type: 1000
Sep 17 14:53:15 rcas0055 proofslave[14487]: tigist:worker-0.0:Error:<TProofServ::HandleSocketInput>:unknown command 1011
Sep 17 15:29:59 rcas0055 proofserv[16974]: tigist:master-0:*** Break ***:segmentation violation
Sep 17 16:39:42 rcas0055 proofserv[22320]: tigist:master-0:SysError:<TUnixSystem::UnixSend>:send (Connection reset by peer)
Sep 17 16:39:42 rcas0055 proofserv[22320]: tigist:master-0:SysError:<TUnixSystem::DispatchOneEvent>:select: read error on 4  (Bad file descriptor)
Sep 17 16:43:59 rcas0055 proofserv[22320]: tigist:master-0:SysError:<TProofServ::SendLogFile>:error sending log file (No such file or directory)
Sep 17 16:43:59 rcas0055 proofserv[22320]: tigist:master-0:SysError:<TUnixSystem::DispatchOneEvent>:select: read error on 4  (Bad file descriptor)
============================

    I am not sure what really is going on with the proof sessions.

Selemon,



-----Original Message-----
From: Flemming Videbaek [mailto:videbaek_at_bnl.gov]
Sent: Thu 9/13/2007 6:50 PM
To: Bekele, Selemon; brahms-dev-l_at_lists.bnl.gov
Subject: Re: [Brahms-dev-l] proof
 
I guess this means(as indicated before) there is a bad file in your sample.
histograms are empty when the slave processes terminate with run-time errors.

flemming


--------------------------------------------
Flemming Videbaek
Physics Department
Bldg 510-D
Brookhaven National Laboratory
Upton, NY11973

phone: 631-344-410
cell:       631-681-1596
fax:        631-344-1334
e-mail: videbaek @ bnl gov
----- Original Message ----- 
From: "Bekele, Selemon" <bekeleku_at_ku.edu>
To: "Flemming Videbaek" <videbaek_at_bnl.gov>; <brahms-dev-l_at_lists.bnl.gov>
Sent: Thursday, September 13, 2007 7:22 PM
Subject: RE: [Brahms-dev-l] proof



Hi Flemming,

    I have been trying different number of runs to see
what the problem is. I checked 10,20,30,40,50 runs and
I get histograms with data. When I use 57 runs, I get
empty histograms. I checked the last seven runs one by
one

/brahms/data14//data/run05/cucu/200/r13364/dst/dst013364v2p3.root
/brahms/data14//data/run05/cucu/200/r13365/dst/dst013365v2p3.root
/brahms/data14//data/run05/cucu/200/r13366/dst/dst013366v2p3.root
/brahms/data14//data/run05/cucu/200/r13378/dst/dst013378v2p3.root
/brahms/data19//data/run05/cucu/200/r13402/dst/dst013402v2p3.root
/brahms/data19//data/run05/cucu/200/r13403/dst/dst013403v2p3.root
/brahms/data19//data/run05/cucu/200/r13404/dst/dst013404v2p3.root

Except for run 13403, all the others have data in them. I combined
these seven runs and run a proof session and had no problem getting
histograms with data, i.e, the presence of run 13403 did not affect
the results.

   I do not see a log file in ~/proof. I checked /var/log/ROOT.log on
rcas0055 and I get the following (part of the log file):

===================================
Sep 13 13:22:04 rcas0055 proofslave[25001]: tigist:worker-0.0:*** Break ***:bus error
Sep 13 13:22:13 rcas0055 proofslave[26833]: !!!cleanup!!!
Sep 13 13:22:13 rcas0055 proofslave[26834]: !!!cleanup!!!
Sep 13 13:51:40 rcas0055 proofserv[17773]: tigist:master-0:SysError:<TUnixSystem::DispatchOneEvent>:select: read error
on 4  (Bad file descriptor)
Sep 13 13:57:24 rcas0055 last message repeated 5 times
Sep 13 13:57:24 rcas0055 last message repeated 4 times
Sep 13 13:57:24 rcas0055 proofserv[26827]: tigist:master-0:SysError:<TUnixSystem::UnixRecv>:recv (Connection reset by peer)
==================================

As the last line shows the connection to the proofserv seems to be reset
for some reason, may be due to the errors above. May be that is why I am
getting empty histograms.

If there is no limit on the number of runs, could there be a time
limit on the proof sessions?

Selemon,

The whole log file is given below:

Sep 13 12:59:27 rcas0055 proofserv[24995]: tigist:master-0:Info:<TProofServ::SetQueryRunning>:starting query: 1
Sep 13 12:59:28 rcas0055 proofserv[24995]: tigist:master-0:Info:<TAdaptivePacketizer::TAdaptivePacketizer>:fraction of
remote files 1.000000
Sep 13 12:59:29 rcas0055 proofslave[25002]: tigist:worker-0.1:Info:<TUnixSystem::ACLiC>:creating shared library 
/direct/brahms+u/tigist/proof/slave-0.1-rcas0055-1189702720-25002/./SpectrumReplay_C.so
Sep 13 12:59:29 rcas0055 proofslave[25001]: tigist:worker-0.0:Info:<TUnixSystem::ACLiC>:creating shared library 
/direct/brahms+u/tigist/proof/slave-0.0-rcas0055-1189702720-25001/./SpectrumReplay_C.so
Sep 13 12:59:57 rcas0055 proofslave[25002]: !!!cleanup!!!
Sep 13 12:59:57 rcas0055 proofslave[25001]: !!!cleanup!!!
Sep 13 13:10:00 rcas0055 proofserv[24995]: !!!cleanup!!!
Sep 13 13:16:39 rcas0055 proofserv[24995]: tigist:master-0:Error:<TProofServ::HandleSocketInput>:retrieving message from input 
socket
Sep 13 13:16:39 rcas0055 proofslave[25002]: tigist:worker-0.1:Error:<TProofServ::HandleSocketInput>:retrieving message
from input socket
Sep 13 13:16:39 rcas0055 proofslave[25001]: tigist:worker-0.0:Error:<TProofServ::HandleSocketInput>:retrieving message
from input socket
Sep 13 13:21:24 rcas0055 proofserv[26827]: tigist:master-0:Info:<TProofServ::SetQueryRunning>:starting query: 1
Sep 13 13:21:35 rcas0055 proofserv[26827]: tigist:master-0:Info:<TAdaptivePacketizer::TAdaptivePacketizer>:fraction of
remote files 1.000000
Sep 13 13:21:35 rcas0055 proofslave[26834]: tigist:worker-0.1:Info:<TUnixSystem::ACLiC>:creating shared library 
/direct/brahms+u/tigist/proof/slave-0.1-rcas0055-1189704040-26834/./SpectrumReplay_C.so
Sep 13 13:21:35 rcas0055 proofslave[26833]: tigist:worker-0.0:Info:<TUnixSystem::ACLiC>:creating shared library 
/direct/brahms+u/tigist/proof/slave-0.0-rcas0055-1189704040-26833/./SpectrumReplay_C.so
Sep 13 13:22:04 rcas0055 proofslave[25001]: tigist:worker-0.0:*** Break ***:bus error
Sep 13 13:22:13 rcas0055 proofslave[26833]: !!!cleanup!!!
Sep 13 13:22:13 rcas0055 proofslave[26834]: !!!cleanup!!!
Sep 13 13:51:40 rcas0055 proofserv[17773]: tigist:master-0:SysError:<TUnixSystem::DispatchOneEvent>:select: read error
on 4  (Bad file descriptor)
Sep 13 13:57:24 rcas0055 last message repeated 5 times
Sep 13 13:57:24 rcas0055 last message repeated 4 times
Sep 13 13:57:24 rcas0055 proofserv[26827]: tigist:master-0:SysError:<TUnixSystem::UnixRecv>:recv (Connection reset by peer)
Sep 13 13:57:24 rcas0055 proofserv[26827]: !!!cleanup!!!
Sep 13 14:19:33 rcas0055 proofserv[26827]: tigist:master-0:Error:<TProofServ::HandleSocketInput>:retrieving message from input 
socket
Sep 13 14:19:33 rcas0055 proofslave[26833]: tigist:worker-0.0:Error:<TProofServ::HandleSocketInput>:retrieving message
from input socket
Sep 13 14:19:33 rcas0055 proofslave[26834]: tigist:worker-0.1:Error:<TProofServ::HandleSocketInput>:retrieving message
from input socket



-----Original Message-----
From: Flemming Videbaek [mailto:videbaek_at_bnl.gov]
Sent: Tue 9/11/2007 5:57 PM
To: Bekele, Selemon; brahms-dev-l_at_lists.bnl.gov
Subject: Re: [Brahms-dev-l] proof

Hi Selemon,

You get empty histograms if there are run-time errors during execution of files. With new proof
it stops, and you can look in the log-files. There should be no limit on the #files (I have run with many)

Flemming




-----Original Message-----
From: Flemming Videbaek [mailto:videbaek_at_bnl.gov]
Sent: Tue 9/11/2007 5:57 PM
To: Bekele, Selemon; brahms-dev-l_at_lists.bnl.gov
Subject: Re: [Brahms-dev-l] proof

Hi Selemon,

You get empty histograms if there are run-time errors during execution of files. With new proof
it stops, and you can look in the log-files. There should be no limit on the #files (I have run with many)

Flemming



--------------------------------------------
Flemming Videbaek
Physics Department
Bldg 510-D
Brookhaven National Laboratory
Upton, NY11973

phone: 631-344-4106
cell:       631-681-1596
fax:        631-344-1334
e-mail: videbaek @ bnl gov
----- Original Message ----- 
From: "Bekele, Selemon" <bekeleku_at_ku.edu>
To: <brahms-dev-l_at_lists.bnl.gov>
Sent: Tuesday, September 11, 2007 6:49 PM
Subject: [Brahms-dev-l] proof


>
> Hi,
>
>    I am having some problems running proof.
> Over the past week, I have been trying to get
> spectra for the 40A1050 MRS setting. I get 57
> dst files at this setting. After running a proof
> session over all 57 files, I get an output file
> with empty histograms. I checked running with a
> small set of data files, e.g. 10, 20, 30 dst files,
> and I get histograms filled with data.
>
>    Does anyone know if there is a limit on the
> number of files one can process in proof?
>
> thanks,
>
> Selemon
> _______________________________________________
> Brahms-dev-l mailing list
> Brahms-dev-l_at_lists.bnl.gov
> https://lists.bnl.gov/mailman/listinfo/brahms-dev-l
>



_______________________________________________
Brahms-dev-l mailing list
Brahms-dev-l_at_lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/brahms-dev-l
Received on Mon Sep 17 2007 - 18:48:13 EDT

This archive was generated by hypermail 2.2.0 : Mon Sep 17 2007 - 18:48:38 EDT