Re: Proposing BRAT/bratmain addition (PLEASE READ!)

From: Bjorn H Samset (bjornhs@rcf2.rhic.bnl.gov)
Date: Thu Dec 12 2002 - 12:38:45 EST
Next message: Peter H. L. Christiansen: "Re: Proposing BRAT/bratmain addition (PLEASE READ!)"
Previous message: Hironori Ito: "breg/base update"
Maybe in reply to: Bjorn H Samset: "Proposing BRAT/bratmain addition (PLEASE READ!)"
Next in thread: Peter H. L. Christiansen: "Re: Proposing BRAT/bratmain addition (PLEASE READ!)"
Reply: Peter H. L. Christiansen: "Re: Proposing BRAT/bratmain addition (PLEASE READ!)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
On Wed, 11 Dec 2002, Christian Holm Christensen wrote:

Hi Christian. As a student politician I've learned that any proposal to
discontinue a matter should be treated first so I'll start with your
comments ;-) (This is a long answer to a long email, so let me just start
out by encouraging more people to comment. It's important to discuss
this.)

First of all, let me say that I wholehartedly agree with your wish for a
good, clean and working software environment. However, we disagree
somewhat on what is acceptable to actually get the work done and also what
the general brahms way of working is. One of your main arguments is that
everything I propose is contained in the scripts and so we could simply
store those in CVS. This, however, is not how we actually use the scripts.
To your average brahms data analyst, the framework looks like this:
* BRAT models and personal code = what does the work
* bratman = the framework to do the work in
* script = what you use, work, modify, play with to get the reconstruction
and bigger data runs out
* after that, it's personal software, located in brahms_app (together with
a main version of the scripts)
My point is that we don't have one script per reconstruction! We may (and
should!) have one for every major official reco pass, but brat/bratmain is
used for so much more. I'll come back to this later on.

Let me, as you ask for, specify a bit more what I want out of my suggested
framework and why, using your questions:

* When do you want the information?

  First and foremost, I want it when I'm snooping through my own or others
data directories and see a file called data_run007324.root. I want there,
next to this file, to be another file called e.g. snp_data_run007324.log,
where I can find out
  - what were the input files
  - what was done to them
  - what exactly does this file contain

As it is now this information is not avaliable anywhere, and it is NOT
supplied by e.g. putting a copy of the bratmain script in the same
directory as this takes the input files from the command
line/shellscript/whatever. (Kris: This answers your comment too: You're
abolutely right. I should not remove the IO, and had I not been on my way
to the airport I might have though about it before writing this comment
;-)

Of course in the ideal world we have this case:
  - there is only one set of data files for a given type of analysis
  - they contain all the info they should, and it is correct
  - there is a database that contains all info on how the files were made
  - there is an easy-to-use program to access the db and get this info
    ( with interface brGetRecoConditions filename.root or something )
  - the db is accessible from every machine that I might concievably copy
    the datafile onto

However, this is not the case. In addition, we are a large and distributed
collaboration that co-uses a central set of disks and CPUs. Lots of
double-work is currently done because one has no idea what went into a set
of reco files and it is easier to make ones own than to ask for details
from the autor (who may have to dig to find them in the first place).
You'll notice that I'm not talking about the big, all-inclusive official
data reductions here - for those we sould, in my opinion, have all info
(including the script) avaliable in a db with a web-interface so that it
is avaliable everywhere (like the calib. page). I'm talking about the work
and reco's that are done prior to this and/or based on this, which is
actually the significant part of our work.

Also, and this is important, I want a framework that lets me treat data
and analysis in the same way wether I'm at BNL, in Oslo, at home or on an
airplane over the atlantic. I do NOT want to be tied to any particular
database or directory structure, and I want to be able to use my software
in the same way wether or not such a database is important for my specific
task or not. This is the only way we are going to get people to actually
document their work on-the-fly (which after all is what this is all
about), and it's the beauty of e.g. bratmain. I can use the bratmain
framework anywhere and for any task that can be put into a BrModule. With
my snapshots you'll get a concise log of what you're doing wherever you
are and whatever your task in bratmain, and I believe that just as with
bratmain people will (in a while) actually use this and we'll be saved
some of this and last years confusion as to the contents of datafiles.

* What kind of information do you want?

Easy: I want
  - input files
  - what modules were in the pipeline and their CVS ID
  - their exact parameters
  - for user-generated modules, a path to where I may find the code
  - snacks like ROOT and BRAT version numbers (yes, the module CVS ID is
    redundant if we have the BRAT version, but I'd like to get the info
    _without_ doing a cvs checkout or similar)

Also, I require this to be in a human-readable format for easy checking.
Just entering parseable code like you suggest will work (I'm assuming
that everyone who uses brat is C++ literate...), so long as we include ALL
relevant settable parameters of both the module and its parents. What
'relevant' means must be decided by the module autor.

Thirdly, I'd like to be able to either just read in the snapshot file and
rerun the analysis (think lost data disks...) or, which I think is more
useful and relevant, re-generate a bratmain script that I can then modify
as needed and THEN rerun the analysis. All in the interest of making life
easy for the analysis peope (i.e. myself and others like me...)

* Where do you want to store the information?

Primary place: With the data files. Period.

Secondly: In the dir. where I run the analysis, since this is on my
homedisk and presumably well backed-up. (Remember, b.t.w., that not all
data is on an rcf disk. It should of course move there in the end, but for
analysis in progress it is very nice to utilize the 60Gb disk on my office
computer)

Thirdly: Once I have a set of files that I think may be useful for a
while, I personally would put them in CVS.

>     To be quite honest, I think these plain text ASCII files are a
>     pain in the behind that no-one in their right mind would like to
>     manage.

I disagree. Or maybe I'm just not in my right mind, but you are
unfortunately stuck with me ;-)

>     Also, you need to make special parsers to (re)make the
>     configuration scripts and command line options.

Aye. That's easy.

>     It is just so much simpler to make sure your configuration script
>     is around, and make sure that you record the history of that
>     script - i.e., put it in a CVS repository.  This has so many
>     advantages over the proposed job summaries, not to mention that we
>     don't need to write any code, that I don't see the point of the
>     proposal.

To summarize: config scripts don't contain all that we need and I don't
want to grope around peoples private parts (i.e. whatever CVS dir) to find
out what's in a file.

Now for some of your specific comments:

> Argh! What's so difficult in writing
>
>   f << setw(8) << fIntegerMember << std::endl;
>
> compared to
>
>   f << PrintIntmMember(fIntegerMember) << std::endl;
>
> The additional cluttering of the base class isn't worth the gain.

Agreed. The whole point here was pretty-printing for easy parsing or
script regeneration, and this can be done without the added classes. I
also had an idea originally that one might want to overload these
functions but I've reconcidered.

> This format isn't particularly parser-friendly.

Yup. The precise format isn't well though through - I wanted some comments
on the general framework and what should be included first.

> Instead, we should
> mimic the `TObject::SavePrimitive' member function - that is save
> proper C++ statements:
>
>   //__________________________________________________________________
>   // Module: BrTileRdoModule
>   BrTileRdoModule* tileRdoModule =
>     new BrTileRdoModule("MultTile", "TMA reducer");
>   tileRdoModule->SetAdcGapLimit(4096);
>   tileRdoModule->SetSaturationChannel(30000);
>   tileRdoModule->SetThresholdFactor(5);
>   tileRdoModule->SetOutlierMethod(BrMultRdoModule::kNoCorrection);
>   tileRdoModule->SetSingleAdcLow(0);
>   tileRdoModule->SetSingleAdcHigh(500);
>   tileRdoModule->SetSingleAdcComp(10);
>   mainModule->AddModule(tileRdoModule);
>   ...
>
> In this way, the log file is directly `executable' via `bratmain'.

Sounds good, at the cost of some readablily.

> > Also note that two of the parameters occur twice - that's because
> > BrTileRdoModule has two setters that are identical to its parent class
> > BrMultRdoModule. This framework also incourages cleaner coding ;-)
>
> The particular code you refer to _is_ clean.  If you check the usage
> of the particular data members, you'll see the point of the overloaded
> functions: provide specific defaults for TMA/SMA.  The problem is
> rather in the code you sent - the Br{Tile,Si}RdoModule shouldn't emit
> the setting of `fSaturationChannel' etc., as it is BrMultRdoModule
> that owns those parameters, not Br{Tile,Si}RdoModule.   I guess you're
> confusing data members and member functions, and which one defines
> `ownership'.
>
> In the future, please check your code before pinning `unclean coding'
> on others code, thank you.

Touchy, eh? ;-) I didn't actually mean that this code isn't clean, and yes
I do see why the extra setters are there. What I do mean is that this
framework makes the methods for setting parameters very visible and also
the number of setters in a module. Many strange setters and parameters
(and yes, we do have some instances of that...) = a strange logfile. So
don't take offense, please.

<snip a lot that I answered in the general part above>

>   You really only need to know what configuration script was used,
>   it's version number (remember, BrMainModule has fields for setting
>   various version information, as well as author information, and
>   of course as short description) or CVS revision number, the command
>   line options passed to `bratmain', the ROOT and BRAT version.
>
>   Given these pieces of information, you can easily redo a pass:

Aye, but again I don't get quick-and-easy information on e.g. "was this
file made before or after I made my Important Change[TM] to
BrQgpDetector?" Not unless you memorize the BRAT CVS history... I _can_
get it, but it takes a cvs checkout.

>   This of course requires a certain amount of discipline when changing
>   the official scripts in BRAT (and they _must_ be in BRAT), namely:

Aye, for all official scripts. But we do a lot of work on the way to the
official scripts which is just as important to document. (I think I'm
starting to repeat myself, so I'll stop soon...)

>   For calibration constants, and the like, we should record in the
>   output file:
>
>     - the name of the detector,
>     - the name of the parameter,
>     - the revision number of the parameter,
>     - ... other stuff.

This is a good point, even if you don't like my ideas. Calibrations need
to be included somehow in the snapshot.

>   Again, this should really be part of the output file rather than a
>   log or job summary, as it is an intricate part of the data.

Maybe that too, but I _still_ want to be able to see it without opening
the data file itself! Asciifiles actually do have a certain charm.

>     Perhaps we should consider making BrModule and it's descendants
>     persistent, so that one can simply write the module pipeline
>     directly to the file at the end of the job.  Then all the
>     information worth knowing on the modules is right there.  Of
>     course when then need to go through every module and change the
>     class version to some thing greater than 1, and making sure that
>     caches (arrays used for intermediate storeage, file pointers,
>     histograms, <shudder>ntuples</shudder>, <shudder>trees</shudder>,
>     etc. and ladida) are not written to disk.  Again, this is simple,
>     efficient, and we don't need to write too much code (almost
>     nothing in fact).

This is definately a possibility, but it still doesn't cover the need for
quick-and-easy access to the data. Ever opened a root file over the line
from Europe?

<snip nice UML diagram, which is absolutely correct>

>   You probably think that `Ah, here he goes again, talking about ``The
>   Right Way'', and that we need a DB, and ladida, but he forgets that
>   someone needs to write the code, setup the DB, and test it, and
>   ... - and in all, not so easy after all - lets do a Quick'n'Dirty
>   hack [ed: Also known as ``Quirky'n'Defunct''] and be done with it'.

As I only said this once: I don't want the need for a database. Any
database. I agree that having all this in a db is both nice and in some
points essensial, but this is actually a quite separate issue from what
I'm proposing. Database storage and BrPassInfoManager usage should be
implemented (at once!) for the major data passes, dst production etc., but
would be a pain for me when e.g. working on lambdas - at least at this
stage of the analysis. A db would make my snapshots redundant, but in some
circles redundancy is actually concidered a good ting.

>   Well, there's good news for you:  The code is already written, the
>   database is setup, and it little only a little testing!
>
>   The code is in `BrPassInfoManager' by Kris, and the database I
>   believe is called `BrahmsPass' on rcas0005.  Ok, so there's a few
>   things that isn't done altogether yet, but it isn't much, and it
>   _is_ The Right Way(tm) :-)

And a nice way, too. Could you or Kris write some documentation so that we
can all know about it?

Well, I guess this whole email should have had a <rant> on it. To
conclude, Christian, I really believe that this framework will make
peoples lives easier. If we actually use it that is, which I think is a
distinct possibility since all the people I've discussed it with have been
very positive. I agree that for certain major parts of our analysis we
should in addition have all this in a database, but I also know that the
only way for me to be able to tell next year if the specific data-thingy
that David made on his local NBI computer and migrated to RCF before he
handed in his thesis and left for industry actually contains, is if there
is an easy-to-use documentation scheme out there and the documentation
produced is easy-to-read.

Therefore, I still want to implement these changes ;-)

--
Bjorn H. Samset                           Phone: 22856465/92051998
PhD student, heavy ion physics            Adr:   Schouterrassen 6
Inst. of Physics, University of Oslo             0573 Oslo
                              \|/
----------------------------> -*- <-----------------------------
                              /|\
Next message: Peter H. L. Christiansen: "Re: Proposing BRAT/bratmain addition (PLEASE READ!)"
Previous message: Hironori Ito: "breg/base update"
Maybe in reply to: Bjorn H Samset: "Proposing BRAT/bratmain addition (PLEASE READ!)"
Next in thread: Peter H. L. Christiansen: "Re: Proposing BRAT/bratmain addition (PLEASE READ!)"
Reply: Peter H. L. Christiansen: "Re: Proposing BRAT/bratmain addition (PLEASE READ!)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
This archive was generated by hypermail 2.1.5 : Thu Dec 12 2002 - 13:58:53 EST