Back at Linuxworld Boston Michael mentioned a teensy performance problem with an internal spreadsheet (sorry, it's confidential and can't be posted).
This partially autogenerated 50M xls monster has been chock full of useful compatibility tests for OOo. Unfortunately, one of my recent patches was forcing the pivot tables to regenerate on load, rather than only on demand later, and drove load time up into the 3 hour range. MS Excel could load it < 1 minute.
The first step was to throw speedprof (properly patched for OOo) at it. Why not use a sexier tool like cachegrind or oprofile you wonder ? The short answer is simplicity. For a rough first cut speedprof is good enough and doesn't have much time/space overhead. The result showed a hotspot in the xls importer itself with lots of code of the form
long nCount = aMemberList.Count();
for (i=0; i<nCount; i++) {
const ScDPSaveMember* pMember = (const ScDPSaveMember*)aMemberList.GetObject(i);
if ( pMember->GetName() == rName )
return pMember;
A quick check showed that 'aMemberList' really was a list. Once I'd bandaged my forhead and checked the monitor for damage the first patch was obvious. This code was wrong on several levels. Let's count the complexity.
1) List::Count : Why not just iterate on the list directly and save the lookup ?
2) List::GetObject(i) : Again, why start from the begining of the list each time when you just what to iterate through each element ?
3) if (pMember->GetName() == rName) : Why look things up in order when what you want to look them up by name ?
The first patch was conceptually simple, get a hash in place of that list. It took some spelunking into the data structures to make that possible but in the end Patch1 brought us down to 45 minutes without bloating the memory usage much.
The next speedprof run seemed as if the construction of the datapilots was uniformly slow, but a bit of digging showed that one particular pivot tables was dominating. It had a field with 30,000 unique strings. The code used similar idioms previous block.
ScDPItemData aMemberData;
long nCount = aMembers.Count();
for (long i=0; i<nCount; i++) {
ScDPResultMember* pMember = aMembers[(USHORT)i];
target->FillItemData( aMemberData );
if ( bIsDataLayout || aMemberData->IsNamedItem( target ) )
Thankfully it used an array in place of a list, but it threw an extra object copy in the heart of the loop to keep things comfortably inefficient. One more patch and we were down to 10 minutes. Still not good, but it's an improvement. The next steps will be to see why OOo is using 900M vs 90M for XL (and that's with wine), and to see about using a set of indicies for the pivot data, rather than a set of strings.
#
Looks like the OpenFormula initiative is going to restart with the goal of improving the interoperability of spreadsheets (different implementations and versions). The traffic got off to a fast start and we've quickly hit an impasse. How comprehensively should we define the evaluation mechanism for conforming spreadsheets ? To my mind any file format that claims to be portable must calculate the same results with different versions/implementations. Differences need to be explained. Others seem to think that the calculations are just for display, akin to different kerning in a word processor.
Microsoft finds the discussion amusing, and claims that their lovely new office 12 formats don't have this problem. I can't check that because the schemas come with a GPL incompatible license. However, I would be very surprised if MS included an appendix listing all standard functions and rigorously defined their behaviours. #I've been quiet for too long now. It's time to say hi, re-join the community, and do a bit of spreadsheet blogging. For the last few months I've been working on OOo's spreadsheet. Given the choice between working on OOo and leaving free software I swallowed my pride and made the leap. To paraphrase the late Douglas Adams 'OOo, is big. Really big. You just won't believe how vastly hugely mindbogglingly big it is. I mean you may think there's a lot of code in emacs, but that's just peanuts compared to OOo." There's lot's of neat stuff in here, and Michael has done some amazing work getting it building mostly painlessly. Gnumeric is still alive and well. The team is on track to release 1.6.0 (with several nice improvements) along with gnome 2.12 in a few weeks. With luck I'll be able to cross-pollinate the projects. My current project in OOo is to add support for R1C1 style references. The core of the patch was simple. I was able to lift a blob of Gnumeric code I'd written a few months back and dual license for inclusion in OOo. The tricky bit is turning out to be the interface change that propagates the choice of address style. #
libgoffice The vacation gave me a bit of time to bite the bullet and start working on pulling this out of gnumeric in earnest. Both kids got sick, and the resulting sleepless exhaustion limited development time, but at least the end is in sight. The remaining elements are
Gnumeric I had tidied up escher export a few weeks ago to enable chart export to xls. Jon Kare picked that up and has been working on image export, something people have been asking for for a long time. Sitting in my inbox was also some absolutely lovely work by Emmanuel to complete his work axis mapping (invert, log) for the 1.5d charts (col/bar/line/area). While he was at it he ripped out all the piecewise patching for libart antialiasing fuzziness, and consolidated it into the pixbuf renderer. The results look awesome. Couple in Kasal's recent gsf-janitor work to polish up the msole exporter and we're looking good for a release. There are still a few win32 porting patches to merge in, but on the whole we should be able to release gnumeric with gnomeoffice-1.2 in conjunction with gnome-2.8. #
Not sleeping well so I spent a few hours doing some mindless coding to complete the sax-style xml exporter for gnumeric. I'll make it the default for 1.3.1 to get some testing. There's a huge speed win for large files. Not allocating 4*uncompressed size is apparently helpful.
Had an intereting chat with Ryan, Charlie, and Mark to marvel at the existence and virtues of Trelane's work on the new Gnumeric website. It can certainly be polished in spots, but the architecture is clean, and it's a hell of a lot better than the monkey see monkey do crud I've be putting up. It is fantastic to finally have some web knowledgeable folk available to put up some more polished a more polished gnome-office website. #
A quiet day.
Walked through the backlanes to the library with Ryan. It went quickly even though I only carried him part way. The innoncence of a two year old with a serious case of the 'Why?'s is an excellent balm for the soul. I don't think we'll tell him about bub. It seems too soon for him to come to grips with the permanence of mortality. A nice copy of Winken, Bliken, and Nod suits him better just now.
I'd best get back to working on a eulogy. #
Beatrice Gittle Goldberg Sep 23 1922-Jun 1 2004
My grandmother (on my father's side) passed away a few moments ago after
struggling wth cancer for several months. I'm very lucky the kids had a
chance to meet her.
#