Wed, 14 Jul 2004
I get a lot of messages asking me to compare and contrast Storage, WinFS, and sometimes Dashboard and Medusa. More recently, I've gotten a lot of questions about Spotlight and Beagle. I've generally avoided commenting (which usually means not answering the e-mail...) on these things both because its impossible for me to do an unbiased comparison, and because the goals seem to be quite different.
- Medusa, Beagle & Spotlight are similar, though of course Spotlight is much more mature. I would call them metadata index systems.
- Storage & WinFS are similar, though of course WinFS is much more mature. I would call them document stores.
Caveat: If indexing and search were the primary goals, a document store would be a ridiculously overengineered approach. The medusa/beagle/spotlight model is much more sane if this is your only or primary goal. I'm not saying this to suggest document stores are better or worse than metadata indexing systems, only to point out that there's an element of apple-orange comparison at work here.
Metadata Index Systems
Medusa:
Medusa was originally written by Eazel integrated tightly with Nautilus 1.0 and was slated for inclusion with the GNOME 1.4 release. It was primarily written by Rebecca Schulman, but also had major contributions from Maciej Stachowiak & some by myself. Medusa ran as root, which worried some people (but of course, so does updatedb for slocate...), but unfortunately had a major bug that caused it to be pulled from GNOME 1.4 at the last minute. Rebecca fixed the bug after the release, and re-architected Medusa to run as a normal user. But unfortunately Eazel collapsed before GNOME 2.0 and nobody promoted its inclusion. Curtis Hovey & I ported it to GNOME 2.x platform later, and Curtis is currently maintaining it and adding lots of new features / fixes. In particular he seems to be working on a UI for it. Medusa allowed very fast searches over large indexes. Indexes were built by scanning the disk every night (like slocate, unlike Spotlight which does things better). It also provided a search: URI scheme that allowed creation of dynamic "search folders". So you could have a "Spreadsheets" folder for example that always contained any spreadsheets on your system. The biggest hurdle for Medusa today is that the set of indexers is not very extensible, and so it doesn't know how to index very many different file types.
Spotlight:
Of course I haven't looked at Spotlight's code or used it, so what I know about it is from what Apple has published and discussions with friends at Apple. Spotlight appears to be a sophisticated well implemented approach to building a metadata layer an top of an existing file system. Changes to files appear to be noticed at the kernel layer, and indexers are quickly run to update the metadata cache (with information about filename, album name, size, file contents, keywords, etc). I don't know whether it is guaranteed that indexers will be run before the data can be accessed, but it is supposed to happen very quickly in any case so it appears instant to the user. Spotlight is the work of (among others, there are probably more people I just don't know) Pavel Cisler (BeOS tracker & Eazel Nautilus) & Dominic Giampaolo (BeOS BFS, which had a similar sophisticated metadata system). Spotlight also has a lot of work gone into the UI, for doing grouping, measuring relevance, etc. Its easy to underestimate how much work this is, in some ways the "indexing" is the easy part. Spotlight appears to index a lot more than just the filesystem, including things like calendar and mail, but I don't know the full extent of what it can do.
Beagle:
My knowledge of Beagle is based on playing with it and reading through a fair bit of the code, but I could definitely be missing large aspects because I haven't talked with Jon. Beagle's code appears to be fairly immature at the moment, but I would expect it to grow. It uses a port of Apache Jarkarta's Lucene. Lucene primarily provides a way to *store* indexed metadata and do fast *searches* over lots of metadata (including full text, of course), but it doesn't provide the indexers for specific file types. In some sense, Lucene as a specialized "database" for storing the results of indexers. Currently Beagle has indexers for HTML, JPEG, MP3, OpenOffice.org (very cool) and Text. Unlike Medusa (I have no idea about Spotlight for this) Beagle is designed to index "byte streams" rather than files, so it can index, e.g. "The current page you are looking at in Epiphany". This makes it very compatible w/ Dashboard, since Dashboard wants to index any and all contextual data, not just things on the hard disk. At the moment Beagle appears to contain only very simple UI, so its primarily a document indexing system.
On the filesystem side, Beagle currently works like Medusa and requires a "crawler" to update its metadata cache (say nightly), vs. spotlight which updates instantly. Beagle also has crawlers for Mail and IM logs. Beagle also includes a renderer system for displaying the relevant metadata etc for different file type results. AFAIK, Jon Trowbridge at Novell is the person mainly hacking on Beagle atm, but I think the code was refactored out of Dashboard, and a number of other contributors are listed.
Document Stores
Both WinFS & Storage are aimed at doing a lot more than document indexing... in many ways document indexing is only a nice side effect of their larger aims. Storage and (AFAICT) to a lesser extent WinFS both intend to store the actual documents themselves inside the store. That means that more than just metadata is inside the store. Both WinFS & Storage provide a query system, though WinFS' has developed a nice object oriented language (which I think they compile to SQL) whereas Storage currently uses straight SQL which is harder for other developers to use.
Storage:
I know most about this so I'll talk about it most of course ;-) Storage is fairly immature, and the architecture has shifted a lot in the past few months.
"storage-store" provides a DBus service that allows fetching objects over the FreeDesktop DBus getting their attributes, relating them to eachother, running queries etc. "storage-store" uses postgresql to store the structured objects and perform queries. Because objects are accessed "live" rather than as "buffers", changes are instantly propagated across the bus, so multiple applications or users can work on the same document and instantly see changes other people make.
I'm currently working on architecture to storage-store into standard IM presence information so you will be able to see buddy icons of other people and what part of the document they are working on inside storage applications. I have a lot of user experience goals for Storage (or more accurately, for applications and desktop that use storage). You can find information about most of them on my blog and at the storage homepage. Though these goals are more important to me than document indexing, and have a lot more impact on Storage's architecture as a result, I will focus on document indexing in order to compare and contrast with the other systems.
libstorage-translators provides a framework for translators that can take structured object data in the store (metadata and the actual data itself) and translate it to and from byte streams (such as files). The goal is not indexing files, but for providing a way to move files in and out of the store. So for example, if your friend sent you a PDF file by e-mail, you could drag that file into your local store and the libstorage-translators will automatically decompose the information for placing in the store (and of course extract lots of metadata like album name, description, image width, etc etc in the process). Currently I have only worked on the "importer" side of translators, not the "exporter", so they are effectively like indexers. There are currently importers for: DocBook, HTML, any image format supported by gdk-pixbuf (JPEG, PNG, BMP, GIF, and several more obscure formats), PDF, text, and any format supported by gstreamer (MP3, OGG, AVI, MPEG2, etc). Importers can also create thumbnails for the data for convenient display later. Storage also includes a renderer system for displaying the relevant metadata etc for different sorts of results to a query. A major drawback is that I don't have translators for common document formats like Gnumeric or OO.o at the moment.
Queries can either be performed using an SQL-like format (slightly higher level than SQL but not much, it gets translated to SQL) or using natural language queries. A large chunk of storage code is currently in its NL system which uses very sophisticated HPSG grammars and other techniques to translate human language phrases into the SQL query format.
A storage:/// VFS URI is provided which automatically invokes translators when files are dragged into the store. That means you can, e.g. open a nautilus window to storage:/// and drag files in to add them to the store. It also provides query folders like Medusa. So for example you can have a folder "spreadsheets" or "songs by John Lennon that don't have the word 'love' in them" that is live updated to contain objects matching those criteria.
WinFS:
I know the least about WinFS of any of the systems discussed here. I need to read up on it more... but the last time I looked at it heavily was more than a year ago when MS was still very ellusive. It looks like a lot of info is up on the web now, so what I'm saying could be out of date. WinFS is backed by both NTFS & Microsoft's SQL server. It provides a very nice API for querying and working with objects. Currently the set of object types it can used is fixed and predefined by MS (but the list is long). In the future they will probably open this up and allow anyone to define new object types. AFAICT, WinFS is currently targeting primarily the storage of metadata, though it is tightly coupled to the files themselves stored as byte streams in NTFS. It does look like in the future they intend to more completely store things in WinFS. WinFS provides a very cool set of hooks for performing actions in response to changes in the store. WinFS uses this to provide indexing services, but users can also define their own actions (e.g. you could say, "whenever an e-mail from George is created, copy it into my "to burn" directory").