From: Bjorn H Samset (bjornhs@rcf2.rhic.bnl.gov)
Date: Thu Dec 12 2002 - 12:38:45 EST
On Wed, 11 Dec 2002, Christian Holm Christensen wrote: Hi Christian. As a student politician I've learned that any proposal to discontinue a matter should be treated first so I'll start with your comments ;-) (This is a long answer to a long email, so let me just start out by encouraging more people to comment. It's important to discuss this.) First of all, let me say that I wholehartedly agree with your wish for a good, clean and working software environment. However, we disagree somewhat on what is acceptable to actually get the work done and also what the general brahms way of working is. One of your main arguments is that everything I propose is contained in the scripts and so we could simply store those in CVS. This, however, is not how we actually use the scripts. To your average brahms data analyst, the framework looks like this: * BRAT models and personal code = what does the work * bratman = the framework to do the work in * script = what you use, work, modify, play with to get the reconstruction and bigger data runs out * after that, it's personal software, located in brahms_app (together with a main version of the scripts) My point is that we don't have one script per reconstruction! We may (and should!) have one for every major official reco pass, but brat/bratmain is used for so much more. I'll come back to this later on. Let me, as you ask for, specify a bit more what I want out of my suggested framework and why, using your questions: * When do you want the information? First and foremost, I want it when I'm snooping through my own or others data directories and see a file called data_run007324.root. I want there, next to this file, to be another file called e.g. snp_data_run007324.log, where I can find out - what were the input files - what was done to them - what exactly does this file contain As it is now this information is not avaliable anywhere, and it is NOT supplied by e.g. putting a copy of the bratmain script in the same directory as this takes the input files from the command line/shellscript/whatever. (Kris: This answers your comment too: You're abolutely right. I should not remove the IO, and had I not been on my way to the airport I might have though about it before writing this comment ;-) Of course in the ideal world we have this case: - there is only one set of data files for a given type of analysis - they contain all the info they should, and it is correct - there is a database that contains all info on how the files were made - there is an easy-to-use program to access the db and get this info ( with interface brGetRecoConditions filename.root or something ) - the db is accessible from every machine that I might concievably copy the datafile onto However, this is not the case. In addition, we are a large and distributed collaboration that co-uses a central set of disks and CPUs. Lots of double-work is currently done because one has no idea what went into a set of reco files and it is easier to make ones own than to ask for details from the autor (who may have to dig to find them in the first place). You'll notice that I'm not talking about the big, all-inclusive official data reductions here - for those we sould, in my opinion, have all info (including the script) avaliable in a db with a web-interface so that it is avaliable everywhere (like the calib. page). I'm talking about the work and reco's that are done prior to this and/or based on this, which is actually the significant part of our work. Also, and this is important, I want a framework that lets me treat data and analysis in the same way wether I'm at BNL, in Oslo, at home or on an airplane over the atlantic. I do NOT want to be tied to any particular database or directory structure, and I want to be able to use my software in the same way wether or not such a database is important for my specific task or not. This is the only way we are going to get people to actually document their work on-the-fly (which after all is what this is all about), and it's the beauty of e.g. bratmain. I can use the bratmain framework anywhere and for any task that can be put into a BrModule. With my snapshots you'll get a concise log of what you're doing wherever you are and whatever your task in bratmain, and I believe that just as with bratmain people will (in a while) actually use this and we'll be saved some of this and last years confusion as to the contents of datafiles. * What kind of information do you want? Easy: I want - input files - what modules were in the pipeline and their CVS ID - their exact parameters - for user-generated modules, a path to where I may find the code - snacks like ROOT and BRAT version numbers (yes, the module CVS ID is redundant if we have the BRAT version, but I'd like to get the info _without_ doing a cvs checkout or similar) Also, I require this to be in a human-readable format for easy checking. Just entering parseable code like you suggest will work (I'm assuming that everyone who uses brat is C++ literate...), so long as we include ALL relevant settable parameters of both the module and its parents. What 'relevant' means must be decided by the module autor. Thirdly, I'd like to be able to either just read in the snapshot file and rerun the analysis (think lost data disks...) or, which I think is more useful and relevant, re-generate a bratmain script that I can then modify as needed and THEN rerun the analysis. All in the interest of making life easy for the analysis peope (i.e. myself and others like me...) * Where do you want to store the information? Primary place: With the data files. Period. Secondly: In the dir. where I run the analysis, since this is on my homedisk and presumably well backed-up. (Remember, b.t.w., that not all data is on an rcf disk. It should of course move there in the end, but for analysis in progress it is very nice to utilize the 60Gb disk on my office computer) Thirdly: Once I have a set of files that I think may be useful for a while, I personally would put them in CVS. > To be quite honest, I think these plain text ASCII files are a > pain in the behind that no-one in their right mind would like to > manage. I disagree. Or maybe I'm just not in my right mind, but you are unfortunately stuck with me ;-) > Also, you need to make special parsers to (re)make the > configuration scripts and command line options. Aye. That's easy. > It is just so much simpler to make sure your configuration script > is around, and make sure that you record the history of that > script - i.e., put it in a CVS repository. This has so many > advantages over the proposed job summaries, not to mention that we > don't need to write any code, that I don't see the point of the > proposal. To summarize: config scripts don't contain all that we need and I don't want to grope around peoples private parts (i.e. whatever CVS dir) to find out what's in a file. Now for some of your specific comments: > Argh! What's so difficult in writing > > f << setw(8) << fIntegerMember << std::endl; > > compared to > > f << PrintIntmMember(fIntegerMember) << std::endl; > > The additional cluttering of the base class isn't worth the gain. Agreed. The whole point here was pretty-printing for easy parsing or script regeneration, and this can be done without the added classes. I also had an idea originally that one might want to overload these functions but I've reconcidered. > This format isn't particularly parser-friendly. Yup. The precise format isn't well though through - I wanted some comments on the general framework and what should be included first. > Instead, we should > mimic the `TObject::SavePrimitive' member function - that is save > proper C++ statements: > > //__________________________________________________________________ > // Module: BrTileRdoModule > BrTileRdoModule* tileRdoModule = > new BrTileRdoModule("MultTile", "TMA reducer"); > tileRdoModule->SetAdcGapLimit(4096); > tileRdoModule->SetSaturationChannel(30000); > tileRdoModule->SetThresholdFactor(5); > tileRdoModule->SetOutlierMethod(BrMultRdoModule::kNoCorrection); > tileRdoModule->SetSingleAdcLow(0); > tileRdoModule->SetSingleAdcHigh(500); > tileRdoModule->SetSingleAdcComp(10); > mainModule->AddModule(tileRdoModule); > ... > > In this way, the log file is directly `executable' via `bratmain'. Sounds good, at the cost of some readablily. > > Also note that two of the parameters occur twice - that's because > > BrTileRdoModule has two setters that are identical to its parent class > > BrMultRdoModule. This framework also incourages cleaner coding ;-) > > The particular code you refer to _is_ clean. If you check the usage > of the particular data members, you'll see the point of the overloaded > functions: provide specific defaults for TMA/SMA. The problem is > rather in the code you sent - the Br{Tile,Si}RdoModule shouldn't emit > the setting of `fSaturationChannel' etc., as it is BrMultRdoModule > that owns those parameters, not Br{Tile,Si}RdoModule. I guess you're > confusing data members and member functions, and which one defines > `ownership'. > > In the future, please check your code before pinning `unclean coding' > on others code, thank you. Touchy, eh? ;-) I didn't actually mean that this code isn't clean, and yes I do see why the extra setters are there. What I do mean is that this framework makes the methods for setting parameters very visible and also the number of setters in a module. Many strange setters and parameters (and yes, we do have some instances of that...) = a strange logfile. So don't take offense, please. <snip a lot that I answered in the general part above> > You really only need to know what configuration script was used, > it's version number (remember, BrMainModule has fields for setting > various version information, as well as author information, and > of course as short description) or CVS revision number, the command > line options passed to `bratmain', the ROOT and BRAT version. > > Given these pieces of information, you can easily redo a pass: Aye, but again I don't get quick-and-easy information on e.g. "was this file made before or after I made my Important Change[TM] to BrQgpDetector?" Not unless you memorize the BRAT CVS history... I _can_ get it, but it takes a cvs checkout. > This of course requires a certain amount of discipline when changing > the official scripts in BRAT (and they _must_ be in BRAT), namely: Aye, for all official scripts. But we do a lot of work on the way to the official scripts which is just as important to document. (I think I'm starting to repeat myself, so I'll stop soon...) > For calibration constants, and the like, we should record in the > output file: > > - the name of the detector, > - the name of the parameter, > - the revision number of the parameter, > - ... other stuff. This is a good point, even if you don't like my ideas. Calibrations need to be included somehow in the snapshot. > Again, this should really be part of the output file rather than a > log or job summary, as it is an intricate part of the data. Maybe that too, but I _still_ want to be able to see it without opening the data file itself! Asciifiles actually do have a certain charm. > Perhaps we should consider making BrModule and it's descendants > persistent, so that one can simply write the module pipeline > directly to the file at the end of the job. Then all the > information worth knowing on the modules is right there. Of > course when then need to go through every module and change the > class version to some thing greater than 1, and making sure that > caches (arrays used for intermediate storeage, file pointers, > histograms, <shudder>ntuples</shudder>, <shudder>trees</shudder>, > etc. and ladida) are not written to disk. Again, this is simple, > efficient, and we don't need to write too much code (almost > nothing in fact). This is definately a possibility, but it still doesn't cover the need for quick-and-easy access to the data. Ever opened a root file over the line from Europe? <snip nice UML diagram, which is absolutely correct> > You probably think that `Ah, here he goes again, talking about ``The > Right Way'', and that we need a DB, and ladida, but he forgets that > someone needs to write the code, setup the DB, and test it, and > ... - and in all, not so easy after all - lets do a Quick'n'Dirty > hack [ed: Also known as ``Quirky'n'Defunct''] and be done with it'. As I only said this once: I don't want the need for a database. Any database. I agree that having all this in a db is both nice and in some points essensial, but this is actually a quite separate issue from what I'm proposing. Database storage and BrPassInfoManager usage should be implemented (at once!) for the major data passes, dst production etc., but would be a pain for me when e.g. working on lambdas - at least at this stage of the analysis. A db would make my snapshots redundant, but in some circles redundancy is actually concidered a good ting. > Well, there's good news for you: The code is already written, the > database is setup, and it little only a little testing! > > The code is in `BrPassInfoManager' by Kris, and the database I > believe is called `BrahmsPass' on rcas0005. Ok, so there's a few > things that isn't done altogether yet, but it isn't much, and it > _is_ The Right Way(tm) :-) And a nice way, too. Could you or Kris write some documentation so that we can all know about it? Well, I guess this whole email should have had a <rant> on it. To conclude, Christian, I really believe that this framework will make peoples lives easier. If we actually use it that is, which I think is a distinct possibility since all the people I've discussed it with have been very positive. I agree that for certain major parts of our analysis we should in addition have all this in a database, but I also know that the only way for me to be able to tell next year if the specific data-thingy that David made on his local NBI computer and migrated to RCF before he handed in his thesis and left for industry actually contains, is if there is an easy-to-use documentation scheme out there and the documentation produced is easy-to-read. Therefore, I still want to implement these changes ;-) -- Bjorn H. Samset Phone: 22856465/92051998 PhD student, heavy ion physics Adr: Schouterrassen 6 Inst. of Physics, University of Oslo 0573 Oslo \|/ ----------------------------> -*- <----------------------------- /|\
This archive was generated by hypermail 2.1.5 : Thu Dec 12 2002 - 13:58:53 EST