Archive for December, 2009

Tobias Berg

To Crawl or not to Crawl

December 11 - 2009 | Tobias Berg

Having an Enterprise Search Engine, there are basically two ways of getting content into the index; using a web crawler or a connector. Both methods have their advantages and disadvantages. In this post I’ll try to poinpoint the differences with the two methods.

Web crawler

Most systems of today have a web-interface. Let it be your time reporting system, intranet, document management, you’ll probably access those with your web browser. Because of this, it’s very easy to use a web crawler to index this content as well.

The web crawler index the pages by starting at one page. From there, it follows all outbound links and index those. From those pages, it follows all links, and so on. This process continues until all links at a web site has been followed and the pages been indexed. The crawler thus uses the same technique as a human, visit a page and clicking the links.

Most Enterprise Search Engines are bundled with a web crawler. Thus, it’s usually very easy to get started. Just enter a start page and within minutes you’ll have searchable content in your index. No extra installation or license fee are required. For some sources, this may also be the only option, i.e if you’re indexing external sources that your company has no control of.

The main disadvantage though, is that web pages are designed for humans, not crawlers. This means that there are a lot of extra information for presentation purposes, such as navigation menus, sticky information messages, headers and footers and so on. All of this makes it a more pleasant experience for the user, and also making it easier to navigate on the page. The crawler on the other hand has no use of this information when retrieving pages. It’s actually reducing information quality in the index. For example, a navigation menu will be displayed on every page, thus the crawler will index the navigation content for all pages. So if you have a navigation item called “Customers” and a user searches for customers, he/she will get a hit in ALL pages in the index.

There are ways to get around this, but it requires either altering of the produced HTML or adjustments in the search engine. Also, if the design of the site change, you have to do these adjustments again.

Connector

Even though the majority of systems has a web-interface, the content is stored in a data source of some format. It might be a database, structured file system, etc. By using a connector, you connect either to the underlying data source or to the system directly by its programming API.

Using a connector, the search engine does not get any presentation information but only the pure content, making the information quality in the index better. The connector can also retrieve all metadata associated with the information which further increases the quality. Often, you’ll also have more fine-grained control over what will be indexed with a connector than a web crawler.

Though, using a connector requires more configuration. It might also cost some extra money to buy one for your system, and require additional hardware. Though, once set up, it’s most likely to produce more relevant results compared to a web crawler.

Bottom line is it’s a consideration between quality and cost, as most decisions in life :)

Caroline Abrahamsson

Do you know something I don’t? The art of benchmarking

December 1 - 2009 | Caroline Abrahamsson

During the autumn we have been trying to keep our customers and others up to date with the search world by hosting breakfast seminars.
By sharing experiences and discussing with others the participants have taken giant leaps in understanding what search can deliver in true value.
The same goes for sharing experiences between companies, where you often find yourself struggling with the same problems, regardless of business or company size.

We have been discussing how Enterprise search can help intranets, extranets, external sites and support centers to capitalize on their knowledge.
Some of the things that have been discussed:

…Business Cases:
How can search help companies save 100 million SEK/year?
How do you count return on investment (ROI) for search?

…Search functionality:
How and why should you work with:
Key Matches to promote certain content (similar to Google’s sponsored links on the web)
Synonyms (to make sure that the end-users language corresponds to the corporate without having to change the information)
Query completion and suggestion to give the user an overview of what other people have been searching for when they start to type (similar to Apples web site search).

…End-user experience
How can different interfaces serve different information needs and user-groups?
How does your user interface serve your end-users?

…Information Quality
Do taxonomies and folksonomies help us find information faster?
Can search be used to improve the quality of your content?

During the spring we will continue to hold seminars, keeping you up-to date. If you’re not on our mailing list, please send us an e-mail and we’ll make sure you will get an invitation.

During Wednesday and Thursday this week we will be attending the Ability conference to discuss search. Hope to see you there!