DST framework (was Re: follow up of pid...)

From: Christian Holm Christensen (cholm@hehi03.nbi.dk)
Date: Wed Nov 20 2002 - 10:00:43 EST

  • Next message: murray@comp.tamu.edu: "Re: follow up of pid..."
    Hi, 
    
    Some comments on design of `DST' analysis framework.  These comments
    are only dealing with software issues.  
    
    The most important issue in this discussion is the design of the data
    structures.  I cannot stress that point enough.  If we have good data
    structures half the job is done.  Hence, I suggest you start thinking
    about data structures, write-up a specification (preferably in UML)
    and post it on the web for general discussion. Do _not_ start coding
    until a design has been agreed upon. 
    
    A few points I'd like to raise concerning the data structures: 
    
    * Each data structure must have a singular purpose.   That is, a data
      structure should not be used for many kinds of storage of data.  
    
    * Each data structure must be optimised for storage.  That is, no
      redundant data members of the classes, and the smallest possible
      data type must be used.  For example, if one has a number that will
      only take values in the range -32767 to 32767 (PID for example), a
      simple `Short_t' will do - an `Int_t' is overkill (not to mention a
      `Float_t').  
    
    * Data structures _must_ be of fixed size.  That is, no memory
      allocation must be done in the data class (that means no pointers!).
      This is so that `TTree' may be fully exploited. 
    
    * If a data structure needs to refer to other data structures, they
      should do so via either a `TRef' of `TRefArray' data member. 
    
    * Use `TTree' and not `TNtuple'.  One can only store `Float_t' or
      `Double_t' in a `TNutple' and that's just not flexible enough,
      especially if we need to use the same data structures for Au+Au,
      Au+d, and p+p. 
    
    * One must be vary of virtual member functions in the data
      structures.   Virtual function calls are expensive, and derived
      classes should be kept to an absolute minimum. 
    
    * `Applied Cuts' information is as much part of the resulting data set
      as the physics data, and should be treated on an equal footing to
      that.   That means that `cut information' should be written to the
      output file as data structures rather than to a separate ASCII
      file.  That is easily done, utilising collections and customised
      data structures. 
    
    All these considerations should also help speed up the analysis jobs. 
    
    A few comments on the analysis code: 
    
    * Each class (module, task, whatever) may only do _one_ thing.  
    
    * The framework must allow for a high degree of customisation (a la
      `bratmain' and configuration scripts). 
    
    * If multiple loops is needed, then the best way to add information is
      to use friends of trees (`TTree::AddFriend').  In that way, one can
      keep all the information without copying, and it's flexible enough
      to facilitate redoing a seperate step.  
    
    * Each step should be done as a separate job, resulting in new output
      files.   Multiple loops over the data in the same job is a waste of
      time. 
    
    * The jobs should cut away as much information as possible as soon as
      possible. 
    
    I do _not_ recommend using the BRAT data structures and modules as the
    basis for a DST framework.  The problem is, that BRAT is far too slow
    due to far to many allocations and deallocations in the code, and that
    it's not really geared for `TTree's.   Instead, I would recommend a
    schema along the lines of this [1] package. 
    
    Djam had some hick-ups on the DB stuff - in particular he noted that
    the comment field of a calibration revision _must_ contain an
    informative string, or it will be near impossible to figure out what
    happened in the calibration.  I cannot but agree more whole-heartedly,
    and if anyone does not add informative comments in the revisions, they
    should be rolled in tar and feather, put on a railroad track and
    carried into the Atlantic ocean.  Secondly, I'd like to point out that
    SQL is quite a simple language, but you don't really need to know a
    lot about it.  Instead, use our specialised tool `brdbbrowser'
    available in my CVS area [2]. There are also quite a lot of
    general-purpose mysgl browsers available out there. 
    
     ___  |  Christian Holm Christensen 
      |_| |	 -------------------------------------------------------------
        | |	 Address: Sankt Hansgade 23, 1. th.  Phone:  (+45) 35 35 96 91
         _|	          DK-2200 Copenhagen N       Cell:   (+45) 24 61 85 91
        _|	          Denmark                    Office: (+45) 353  25 305
     ____|	 Email:   cholm@nbi.dk               Web:    www.nbi.dk/~cholm
     | |
    
    [1] http://cholm.home.cern.ch/cholm/root/#rootfw
    [2] cvs -d /afs/rhic/brahms/BRAHMS_CVS co \
            -d brdbbrowser brahms_app/cholm_pp/brdbbrowser
    


    This archive was generated by hypermail 2.1.5 : Wed Nov 20 2002 - 10:01:42 EST