Dang Woody Evans. Dang him to heck. This post is all his fault.
Here's the whole content of a recent post of his at ISHUSH:
"Convince me, someone, that all federated searching does not suck."
Doesn't any particular application of Federated Search suck or not suck depending quite a lot on (among other factors) the needs of the searcher, what resources are searched, and how well the metasearch engine is designed and built?
It seems to me that a blanket "Your favorite technology here sucks" is just too broad to be true.
(I also made the argument that Book Burro is federated search and is great. I notice Woody didn't contest that- probably because Book Burro is awesome.)
Woody elaborated on his complaint very succinctly:
The main problem, as I understand it, is that the f.s. uses vocabulary that isn't (and can't be) universal. So you end up with missed articles because the search term subject headings are not in a universally controlled vocabulary.
Woody and I are clearly not disagreeing. Where I think we differ is that I'm far from ready to give up on federated search as an idea. What follows is a long explanation why.
I used to work for a company that managed HR and benefits data for very large clients. One of our clients, for instance, was one of two very large soft drink companies, and its name rhymed with… "Schmepsi."
Now, imagine you run this company, and you have employees across all 50 states, and you want to give them all comparable employment benefits. Here are your problems:
- Your offices throughout the country use all sorts of different HR systems (PeopleSoft, SAP, a piece-of-poop MS Access database, you name it).
- There isn't a single Health Insurance carrier (or vision, or dental, or flex, etc.) that covers all the markets where you have offices and facilities. In fact, you can't even limit it to just a few, you actually need to work with dozens of different carriers in pretty much every market in the U.S.. All your carriers need different elements of the data, and each has its own proprietary format for data interchange.
- While you're at it, you have to track national trends and expenditures, and manage disbursement of payments to carriers.
So you've already got data from a significant number of different sorts databases chunking out HR data in wildly different export formats using completely different data definitions, and you have to get just the rights parts of this data in just the right format to all the right carriers at the right time. Nightmare, right?
That's where the company I worked for came in. We checked out their systems and their data exports, learned their business needs and data definitions, then designed and built automated custom import processes to get their data from all their various HR systems into OUR database, using OUR data model.
We also got to know their vendors, the vendor's business needs, and the vendors' systems. Then we designed and built custom exports that got JUST the data each vendor needed in EXACTLY the format that worked for the vendor's systems. When this was done well, the results were a wonder to behold. Automated, centralized, electronic data exchange with impressive accuracy.
Near the end of my time working for this data management company, certain HIPAA rules came into effect that mandated a number of our feeds (in and/or out) use a particular standard for EDI called the 834 ASC X12N 834 (004010X095). We affectionately called it "the eight-thirty-four."
(It was while working for this data management company that I learned to think of myself as a data-monkey, and started to learn SQL from some really great programmers.)
In a number of ways, the advent and growth of this EDI standard made the job easier and faster, and it was hoped at the time I left that industry that this increasing move towards standardization would make data interchange easier and less expensive.
What does this have to do with federated search? Everything. What's to stop a good, well-designed, federated search tool from importing indices from multiple databases (per custom designed and built import processes built specifically with the data definition translation needs in mind), and searching the combined indices by the data definitions they now share? Can we not hope that standards of data definition will be be encouraged (/demanded) and eventually implemented to one degree or another by database vendors?
First moral: I'm not saying it is easy. I'm saying it is do-able. I'm nowhere near ready to give up on this relatively young idea.
Woody also wrote:
Federated searching in most cases turns out to be sorta like searching Picasa, Flickr, CC, and Wikimedia Commons all at once for a folksonomic tag term and calling the results thorough.
This is a completely fair point. After all, the data I helped manage at my former job WAS all the same kind of data: HR data. This leads us to the second moral: Some databases shouldn't and can't be effectively searched in a federated model because their data types are not similar enough or because they have virtually no structure.
Third moral: The result set from any one query cannot be called thorough, even it is performed in a single database. If I searched for "library" at Flickr, that wouldn't get me all possible library-related images. Human judgement would also demand that I search for "libraries", "librarian", and a number of other related words and phrases.
However, using Flickr as an example seems a little of an unfair analogy. Sure, some databases don't define their data well enough, but none are as chaotic as the folksonomies online. (That'll be the topic of a future post.)
Thank you, Woody, for the discussion. I HAD written the above as a comment on your blog, but I thought it would be obnoxious to leave a comment of this size.
Please feel free, of course, to tell me I'm all wrong on all of this. I'm wrong a lot, after all.
And, folks: Check out ISHUSH. Woody says stuff that makes me go "hmmmm."