Federated Search: long story & short morals

Dang Woody Evans.  Dang him to heck.  This post is all his fault.

Here's the whole content of a recent post of his at ISHUSH:

"Convince me, someone, that all federated searching does not suck."

I argued:

Doesn't any particular application of Federated Search suck or not suck depending quite a lot on (among other factors) the needs of the searcher, what resources are searched, and how well the metasearch engine is designed and built?

It seems to me that a blanket "Your favorite technology here sucks" is just too broad to be true. 

(I also made the argument that Book Burro is federated search and is great.  I notice Woody didn't contest that- probably because Book Burro is awesome.)

Woody elaborated on his complaint very succinctly:

The main problem, as I understand it, is that the f.s. uses vocabulary that isn't (and can't be) universal. So you end up with missed articles because the search term subject headings are not in a universally controlled vocabulary.

Woody and I are clearly not disagreeing.  Where I think we differ is that I'm far from ready to give up on federated search as an idea.  What follows is a long explanation why. 

I used to work for a company that managed HR and benefits data for very large clients. One of our clients, for instance, was one of two very large soft drink companies, and its name rhymed with… "Schmepsi."

Now, imagine you run this company, and you have employees across all 50 states, and you want to give them all comparable employment benefits. Here are your problems:

  • Your offices throughout the country use all sorts of different HR systems (PeopleSoft, SAP, a piece-of-poop MS Access database, you name it). 
  • There isn't a single Health Insurance carrier (or vision, or dental, or flex, etc.) that covers all the markets where you have offices and facilities.  In fact, you can't even limit it to just a few, you actually need to work with dozens of different carriers in pretty much every market in the U.S..  All your carriers need different elements of the data, and each has its own proprietary format for data interchange.
  • While you're at it, you have to track national trends and expenditures, and manage disbursement of payments to carriers.

So you've already got data from a significant number of different sorts databases chunking out HR data in wildly different export formats using completely different data definitions, and you have to get just the rights parts of this data in just the right format to all the right carriers at the right time.  Nightmare, right?

That's where the company I worked for came in.  We checked out their systems and their data exports, learned their business needs and data definitions, then designed and built automated custom import processes to get their data from all their various HR systems into OUR database, using OUR data model.

We also got to know their vendors, the vendor's business needs, and the vendors' systems.  Then we designed and built custom exports that got JUST the data each vendor needed in EXACTLY the format that worked for the vendor's systems.  When this was done well, the results were a wonder to behold.  Automated, centralized, electronic data exchange with impressive accuracy.

Near the end of my time working for this data management company, certain HIPAA rules came into effect that mandated a number of our feeds (in and/or out) use a particular standard for EDI called the 834 ASC X12N 834 (004010X095).  We affectionately called it "the eight-thirty-four."

(It was while working for this data management company that I learned to think of myself as a data-monkey, and started to learn SQL from some really great programmers.)

In a number of ways, the advent and growth of this EDI standard made the job easier and faster, and it was hoped at the time I left that industry that this increasing move towards standardization would make data interchange easier and less expensive.

What does this have to do with federated search?  Everything.  What's to stop a good, well-designed, federated search tool from importing indices from multiple databases (per custom designed and built import processes built specifically with the data definition translation needs in mind), and searching the combined indices by the data definitions they now share?  Can we not hope that standards of data definition will be be encouraged (/demanded) and eventually implemented to one degree or another by database vendors?

First moral: I'm not saying it is easy.  I'm saying it is do-able.  I'm nowhere near ready to give up on this relatively young idea.

Woody also wrote:

Federated searching in most cases turns out to be sorta like searching Picasa, Flickr, CC, and Wikimedia Commons all at once for a folksonomic tag term and calling the results thorough.

This is a completely fair point.  After all, the data I helped manage at my former job WAS all the same kind of data: HR data.  This leads us to the second moral: Some databases shouldn't and can't be effectively searched in a federated model because their data types are not similar enough or because they have virtually no structure.

Third moral: The result set from any one query cannot be called thorough, even it is performed in a single database.  If I searched for "library" at Flickr, that wouldn't get me all possible library-related images.  Human judgement would also demand that I search for "libraries", "librarian", and a number of other related words and phrases.

However, using Flickr as an example seems a little of an unfair analogy.  Sure, some databases don't define their data well enough, but none are as chaotic as the folksonomies online.  (That'll be the topic of a future post.)

Thank you, Woody, for the discussion.  I HAD written the above as a comment on your blog, but I thought it would be obnoxious to leave a comment of this size. 

Please feel free, of course, to tell me I'm all wrong on all of this.  I'm wrong a lot, after all.

And, folks: Check out ISHUSH.  Woody says stuff that makes me go "hmmmm."

4 thoughts on “Federated Search: long story & short morals

  1. Quick note on 3rd moral — you’ve still got controls for thoroghness within single databases. Truncation will take care of the variations on words, for instance. The result set from a single database CAN BE thorough, most often IS in fact thorough because of controlled vocabulary.
    Note on 1st point — to argue that a truly good federated search is do-able (and I’m going to blow this out into a universal abstraction, an imaginary federated search mechanism for all databases in existence, in all languages), you also have to argue that a universal controlled vocabulary is do-able. That’s a problem. Will we trust LOC subject headings for translations of Tibetan mystical texts, for instance? No single controlled vocab. system can cover all fields of knowledge or human experience, so no single federated search engine is possible. I realize that this gets into a matter of scale, and I won’t try to argue that a good f.s. of a small number of databases isn’t possible, it obviously is possible — eps. within databases from one vendor with one standard vocab. for subjects. But when you get more than two ‘vendors’ involved with different controls and standards and you try to cram them into one workable system, bugs pour from the seams! In the long run, if a goal might be to, say, create a universal ‘google’ f.s. system for ‘all the world’s knowledge’ or something, more and more and more bugs will breed. So to speak. I think.

  2. Woody,

    A part of the problem I have with your line of reasoning is the very fact that you blow it out into a universal abstraction. It would be downright silly to expect a federated search tool to cross all databases. Federated search is best utilized in databases of similar types of data, when the absence of a controlled vocabulary can be compensated for with skillfully-designed code conversions to negotiate the differences.

    Silly example: If you\’ve got three databases of a similar content type from three different vendors (lets say they all three are databases of cooking recipes), I would argue that a good federated search tool could and should be built to simultaniously search all three.

    I think it is a given that federated search is a poor choice of tools when wanting to look in multiple databases from multiple vendors and of disparate content types. I don\’t think you\’ll find anyone to argue to the contrary.

    A \”universal search engine\” *is* an impossible goal without a universal controlled vocabulary, but (this is the most important part) federated search <> \”universal search engine\”, nor should it.

    Thank you again, Woody, for starting this very enjoyable conversation!

  3. Aye aye.

    Univeral abstractions are usually absurd. And so is every single federated search for databases I have yet encountered! The challenge stands: Convince me, someone, that all federated searching does not suck. And I can even alter it a bit to say: Convince me, someone, that ANY federated searching does not suck.

    And thank you.

  4. I maintain: Book Burro is federated search, and is wholly without suckage. Sans suckitude, as it were.