Re: Proposing BRAT/bratmain addition (PLEASE READ!)

From: Bjorn H Samset (bjornhs@rcf2.rhic.bnl.gov)
Date: Thu Dec 12 2002 - 12:38:45 EST

  • Next message: Peter H. L. Christiansen: "Re: Proposing BRAT/bratmain addition (PLEASE READ!)"
    On Wed, 11 Dec 2002, Christian Holm Christensen wrote:
    
    Hi Christian. As a student politician I've learned that any proposal to
    discontinue a matter should be treated first so I'll start with your
    comments ;-) (This is a long answer to a long email, so let me just start
    out by encouraging more people to comment. It's important to discuss
    this.)
    
    First of all, let me say that I wholehartedly agree with your wish for a
    good, clean and working software environment. However, we disagree
    somewhat on what is acceptable to actually get the work done and also what
    the general brahms way of working is. One of your main arguments is that
    everything I propose is contained in the scripts and so we could simply
    store those in CVS. This, however, is not how we actually use the scripts.
    To your average brahms data analyst, the framework looks like this:
    * BRAT models and personal code = what does the work
    * bratman = the framework to do the work in
    * script = what you use, work, modify, play with to get the reconstruction
    and bigger data runs out
    * after that, it's personal software, located in brahms_app (together with
    a main version of the scripts)
    My point is that we don't have one script per reconstruction! We may (and
    should!) have one for every major official reco pass, but brat/bratmain is
    used for so much more. I'll come back to this later on.
    
    Let me, as you ask for, specify a bit more what I want out of my suggested
    framework and why, using your questions:
    
    * When do you want the information?
    
      First and foremost, I want it when I'm snooping through my own or others
    data directories and see a file called data_run007324.root. I want there,
    next to this file, to be another file called e.g. snp_data_run007324.log,
    where I can find out
      - what were the input files
      - what was done to them
      - what exactly does this file contain
    
    As it is now this information is not avaliable anywhere, and it is NOT
    supplied by e.g. putting a copy of the bratmain script in the same
    directory as this takes the input files from the command
    line/shellscript/whatever. (Kris: This answers your comment too: You're
    abolutely right. I should not remove the IO, and had I not been on my way
    to the airport I might have though about it before writing this comment
    ;-)
    
    Of course in the ideal world we have this case:
      - there is only one set of data files for a given type of analysis
      - they contain all the info they should, and it is correct
      - there is a database that contains all info on how the files were made
      - there is an easy-to-use program to access the db and get this info
        ( with interface brGetRecoConditions filename.root or something )
      - the db is accessible from every machine that I might concievably copy
        the datafile onto
    
    However, this is not the case. In addition, we are a large and distributed
    collaboration that co-uses a central set of disks and CPUs. Lots of
    double-work is currently done because one has no idea what went into a set
    of reco files and it is easier to make ones own than to ask for details
    from the autor (who may have to dig to find them in the first place).
    You'll notice that I'm not talking about the big, all-inclusive official
    data reductions here - for those we sould, in my opinion, have all info
    (including the script) avaliable in a db with a web-interface so that it
    is avaliable everywhere (like the calib. page). I'm talking about the work
    and reco's that are done prior to this and/or based on this, which is
    actually the significant part of our work.
    
    Also, and this is important, I want a framework that lets me treat data
    and analysis in the same way wether I'm at BNL, in Oslo, at home or on an
    airplane over the atlantic. I do NOT want to be tied to any particular
    database or directory structure, and I want to be able to use my software
    in the same way wether or not such a database is important for my specific
    task or not. This is the only way we are going to get people to actually
    document their work on-the-fly (which after all is what this is all
    about), and it's the beauty of e.g. bratmain. I can use the bratmain
    framework anywhere and for any task that can be put into a BrModule. With
    my snapshots you'll get a concise log of what you're doing wherever you
    are and whatever your task in bratmain, and I believe that just as with
    bratmain people will (in a while) actually use this and we'll be saved
    some of this and last years confusion as to the contents of datafiles.
    
    * What kind of information do you want?
    
    Easy: I want
      - input files
      - what modules were in the pipeline and their CVS ID
      - their exact parameters
      - for user-generated modules, a path to where I may find the code
      - snacks like ROOT and BRAT version numbers (yes, the module CVS ID is
        redundant if we have the BRAT version, but I'd like to get the info
        _without_ doing a cvs checkout or similar)
    
    Also, I require this to be in a human-readable format for easy checking.
    Just entering parseable code like you suggest will work (I'm assuming
    that everyone who uses brat is C++ literate...), so long as we include ALL
    relevant settable parameters of both the module and its parents. What
    'relevant' means must be decided by the module autor.
    
    Thirdly, I'd like to be able to either just read in the snapshot file and
    rerun the analysis (think lost data disks...) or, which I think is more
    useful and relevant, re-generate a bratmain script that I can then modify
    as needed and THEN rerun the analysis. All in the interest of making life
    easy for the analysis peope (i.e. myself and others like me...)
    
    * Where do you want to store the information?
    
    Primary place: With the data files. Period.
    
    Secondly: In the dir. where I run the analysis, since this is on my
    homedisk and presumably well backed-up. (Remember, b.t.w., that not all
    data is on an rcf disk. It should of course move there in the end, but for
    analysis in progress it is very nice to utilize the 60Gb disk on my office
    computer)
    
    Thirdly: Once I have a set of files that I think may be useful for a
    while, I personally would put them in CVS.
    
    >     To be quite honest, I think these plain text ASCII files are a
    >     pain in the behind that no-one in their right mind would like to
    >     manage.
    
    I disagree. Or maybe I'm just not in my right mind, but you are
    unfortunately stuck with me ;-)
    
    >     Also, you need to make special parsers to (re)make the
    >     configuration scripts and command line options.
    
    Aye. That's easy.
    
    >     It is just so much simpler to make sure your configuration script
    >     is around, and make sure that you record the history of that
    >     script - i.e., put it in a CVS repository.  This has so many
    >     advantages over the proposed job summaries, not to mention that we
    >     don't need to write any code, that I don't see the point of the
    >     proposal.
    
    To summarize: config scripts don't contain all that we need and I don't
    want to grope around peoples private parts (i.e. whatever CVS dir) to find
    out what's in a file.
    
    Now for some of your specific comments:
    
    > Argh! What's so difficult in writing
    >
    >   f << setw(8) << fIntegerMember << std::endl;
    >
    > compared to
    >
    >   f << PrintIntmMember(fIntegerMember) << std::endl;
    >
    > The additional cluttering of the base class isn't worth the gain.
    
    Agreed. The whole point here was pretty-printing for easy parsing or
    script regeneration, and this can be done without the added classes. I
    also had an idea originally that one might want to overload these
    functions but I've reconcidered.
    
    > This format isn't particularly parser-friendly.
    
    Yup. The precise format isn't well though through - I wanted some comments
    on the general framework and what should be included first.
    
    > Instead, we should
    > mimic the `TObject::SavePrimitive' member function - that is save
    > proper C++ statements:
    >
    >   //__________________________________________________________________
    >   // Module: BrTileRdoModule
    >   BrTileRdoModule* tileRdoModule =
    >     new BrTileRdoModule("MultTile", "TMA reducer");
    >   tileRdoModule->SetAdcGapLimit(4096);
    >   tileRdoModule->SetSaturationChannel(30000);
    >   tileRdoModule->SetThresholdFactor(5);
    >   tileRdoModule->SetOutlierMethod(BrMultRdoModule::kNoCorrection);
    >   tileRdoModule->SetSingleAdcLow(0);
    >   tileRdoModule->SetSingleAdcHigh(500);
    >   tileRdoModule->SetSingleAdcComp(10);
    >   mainModule->AddModule(tileRdoModule);
    >   ...
    >
    > In this way, the log file is directly `executable' via `bratmain'.
    
    Sounds good, at the cost of some readablily.
    
    > > Also note that two of the parameters occur twice - that's because
    > > BrTileRdoModule has two setters that are identical to its parent class
    > > BrMultRdoModule. This framework also incourages cleaner coding ;-)
    >
    > The particular code you refer to _is_ clean.  If you check the usage
    > of the particular data members, you'll see the point of the overloaded
    > functions: provide specific defaults for TMA/SMA.  The problem is
    > rather in the code you sent - the Br{Tile,Si}RdoModule shouldn't emit
    > the setting of `fSaturationChannel' etc., as it is BrMultRdoModule
    > that owns those parameters, not Br{Tile,Si}RdoModule.   I guess you're
    > confusing data members and member functions, and which one defines
    > `ownership'.
    >
    > In the future, please check your code before pinning `unclean coding'
    > on others code, thank you.
    
    Touchy, eh? ;-) I didn't actually mean that this code isn't clean, and yes
    I do see why the extra setters are there. What I do mean is that this
    framework makes the methods for setting parameters very visible and also
    the number of setters in a module. Many strange setters and parameters
    (and yes, we do have some instances of that...) = a strange logfile. So
    don't take offense, please.
    
    <snip a lot that I answered in the general part above>
    
    >   You really only need to know what configuration script was used,
    >   it's version number (remember, BrMainModule has fields for setting
    >   various version information, as well as author information, and
    >   of course as short description) or CVS revision number, the command
    >   line options passed to `bratmain', the ROOT and BRAT version.
    >
    >   Given these pieces of information, you can easily redo a pass:
    
    Aye, but again I don't get quick-and-easy information on e.g. "was this
    file made before or after I made my Important Change[TM] to
    BrQgpDetector?" Not unless you memorize the BRAT CVS history... I _can_
    get it, but it takes a cvs checkout.
    
    >   This of course requires a certain amount of discipline when changing
    >   the official scripts in BRAT (and they _must_ be in BRAT), namely:
    
    Aye, for all official scripts. But we do a lot of work on the way to the
    official scripts which is just as important to document. (I think I'm
    starting to repeat myself, so I'll stop soon...)
    
    >   For calibration constants, and the like, we should record in the
    >   output file:
    >
    >     - the name of the detector,
    >     - the name of the parameter,
    >     - the revision number of the parameter,
    >     - ... other stuff.
    
    This is a good point, even if you don't like my ideas. Calibrations need
    to be included somehow in the snapshot.
    
    >   Again, this should really be part of the output file rather than a
    >   log or job summary, as it is an intricate part of the data.
    
    Maybe that too, but I _still_ want to be able to see it without opening
    the data file itself! Asciifiles actually do have a certain charm.
    
    >     Perhaps we should consider making BrModule and it's descendants
    >     persistent, so that one can simply write the module pipeline
    >     directly to the file at the end of the job.  Then all the
    >     information worth knowing on the modules is right there.  Of
    >     course when then need to go through every module and change the
    >     class version to some thing greater than 1, and making sure that
    >     caches (arrays used for intermediate storeage, file pointers,
    >     histograms, <shudder>ntuples</shudder>, <shudder>trees</shudder>,
    >     etc. and ladida) are not written to disk.  Again, this is simple,
    >     efficient, and we don't need to write too much code (almost
    >     nothing in fact).
    
    This is definately a possibility, but it still doesn't cover the need for
    quick-and-easy access to the data. Ever opened a root file over the line
    from Europe?
    
    <snip nice UML diagram, which is absolutely correct>
    
    >   You probably think that `Ah, here he goes again, talking about ``The
    >   Right Way'', and that we need a DB, and ladida, but he forgets that
    >   someone needs to write the code, setup the DB, and test it, and
    >   ... - and in all, not so easy after all - lets do a Quick'n'Dirty
    >   hack [ed: Also known as ``Quirky'n'Defunct''] and be done with it'.
    
    As I only said this once: I don't want the need for a database. Any
    database. I agree that having all this in a db is both nice and in some
    points essensial, but this is actually a quite separate issue from what
    I'm proposing. Database storage and BrPassInfoManager usage should be
    implemented (at once!) for the major data passes, dst production etc., but
    would be a pain for me when e.g. working on lambdas - at least at this
    stage of the analysis. A db would make my snapshots redundant, but in some
    circles redundancy is actually concidered a good ting.
    
    >   Well, there's good news for you:  The code is already written, the
    >   database is setup, and it little only a little testing!
    >
    >   The code is in `BrPassInfoManager' by Kris, and the database I
    >   believe is called `BrahmsPass' on rcas0005.  Ok, so there's a few
    >   things that isn't done altogether yet, but it isn't much, and it
    >   _is_ The Right Way(tm) :-)
    
    And a nice way, too. Could you or Kris write some documentation so that we
    can all know about it?
    
    Well, I guess this whole email should have had a <rant> on it. To
    conclude, Christian, I really believe that this framework will make
    peoples lives easier. If we actually use it that is, which I think is a
    distinct possibility since all the people I've discussed it with have been
    very positive. I agree that for certain major parts of our analysis we
    should in addition have all this in a database, but I also know that the
    only way for me to be able to tell next year if the specific data-thingy
    that David made on his local NBI computer and migrated to RCF before he
    handed in his thesis and left for industry actually contains, is if there
    is an easy-to-use documentation scheme out there and the documentation
    produced is easy-to-read.
    
    Therefore, I still want to implement these changes ;-)
    
    --
    Bjorn H. Samset                           Phone: 22856465/92051998
    PhD student, heavy ion physics            Adr:   Schouterrassen 6
    Inst. of Physics, University of Oslo             0573 Oslo
                                  \|/
    ----------------------------> -*- <-----------------------------
                                  /|\
    


    This archive was generated by hypermail 2.1.5 : Thu Dec 12 2002 - 13:58:53 EST