Archive for the ‘Technology’ Category

Tobias Larsson Hult

Real time search in the Enterprise

maj 10 - 2010 | Tobias Larsson Hult

Real time search is a big fuzz in the global network called Internet. Major search engines like Google and Bing are now providing users with real time search results from Facebook, Twitter, Blogs and other social media sites. Real time search means that as soon as content are created or updated, it is immediately searchable. This might be obvious and seems like a basic requirement, but working with search you know that this is not the case most of the time. Looking inside the firewall, in the enterprise, I dare to say that real time search is far from common. Sometimes content is not changed very frequently so it is not necessary to make it instantly searchable. Though, in many cases it’s the technical architecture that limits a real time search implementation.

The most common way of indexing content is by using a web crawler or a connector. Either way, you schedule them to go out and fetch new/updated/deleted content at specific interval during the day. This is the basic architecture for search platforms these days. The advantage of this approach is that the content systems does not need to adapt to the search platform, they just deliver content through their ordinary API:s during indexing. The drawback is that new or updated content is not available until next scheduled indexing. Depending on the system this might take several hours. Due to several reasons, mostly performance, you do not want to schedule connectors or web crawlers to fetch content too often. Instead, to provide real time search you have to do the other way around; let the content system push content to the search platform.

Most systems have some sort of event system that triggers an event when content is created/updated/deleted. Listening for these events, the system can send the content to the search platform at the same time it’s stored in the content system. The search platform can immediately index the pushed content and make it searchable. This requires adaptation of the content system towards the search platform. In this case though, I think the advantages outweighs the disadvantages. Modern content systems of today are (or should be) providing a plug-in architecture so you should fairly easy be able to plug in this kind of code. These plugins could also be provided by the search platform vendors just as ordinary connectors are provided today.

Do you agree, or have I been living in a cave for the past years? I’d love to hear you comments on this subject!

Max Charas

Solr Processing Pipeline

april 19 - 2010 | Max Charas

Hi again Internet,

For once I have had time to do some thinking. Why is there no powerful data processing layer between the Lucene Connector Framework and Solr? I´ve been looking into the Apache Commons Processing Pipeline. It seems like a likely candidate to do some cool stuff.  Look at the diagram below.

A schematic drawing of a Solr Pipeline concept. (Click to enlarge)

What I´m thinking of is to make a transparent Solr pipeline that speaks the Solr REST protocol on each end. This means that you would be able to use SolrJ or any other API to communicate with the Pipeline.

Has anyone attempted this before?  If you’re interested in chatting about the pipeline drop me a mail or just grab me at Eurocon in Prague this year.

Maria Johansson

Faceted Search by LinkedIn

mars 12 - 2010 | Maria Johansson

My RSS feeds have been buzzing about the LinkedIn faceted search since it was first released from beta in December. So why is the new search at LinkedIn so interesting that people are almost constantly discussing it? I think it’s partly because LinkedIn is a site that is used by most professionals and searching for people is core functionality on LinkedIn. But the search interface on LinkedIn is also a very good example of faceted search.

I decided to have a closer look into their search. The first thing I realized was just how many different kinds of searches there are on LinkedIn. Not only the obvious people search but also, job, news, forum, group, company, address book, answers and reference search. LinkedIn has managed to integrate search so that it’s the natural way of finding information on the site. People search is the most prominent search functionality but not the only one.

I’ve seen several different people search implementations and they often have a tendency to work more or less like phone books. If you know the name you type it and get the number. And if you’re lucky you can also get the name if you only have the number. There is seldom anyway to search for people with a certain competence or from a geographic area. LinkedIn sets a good example of how searching for people could and should work.

LinkedIn has taken careful consideration of their users; What information they are looking for, how they want it presented and how they need to filter searches in order to find the right people. The details that I personally like are the possibility to search within filters for matching options (I worked on a similar solution last year) and how different filters are displayed (or at least in different order) depending on what query the user types. If you want to know more about how the faceted search at LinkedIn was designed, check out the blog post by Sara Alpern.

But LinkedIn is not only interesting because of the good search experience. It’s also interesting from a technical perspective. The LinkedIn search is built on open source so they have developed everything themselves. For those of you interested in the technology behind the new LinkedIn search I recommend “LinkedIn search a look beneath the hood”, by Daniel Tunkelang where he links to a presentation by John Wang search architect at LinkedIn.

Andreas Franzon

Search Driven Portals – Personalizing Search

mars 2 - 2010 | Andreas Franzon

To stay in the front edge within search technology, Findwise has a focus on research, both in the form of larger research projects and with different thesis projects. Mohammad Shadab and I just finished our thesis work at Findwise, where we have explored an idea of search user interfaces which we call search driven portals. User interfaces are mostly based on analysis of a smaller audience but the final interface is then put in production which targets a much wider range of users. The solution is in many cases static and cannot easily be changed or adapted. With Search driven portals, which is a portlet based UI, the users or administrators can adapt the interface specially designed to fulfill the need for different groups. Developers design and develop several searchlets (portlets powered by search technology), where every searchlet provides a specific functionality such as faceted search, results list, related information etc. Users can then choose to add the searchlets with functionality that suits them into their page on a preferred location. From architectural perspective, searchlets are standalone components independent from each other and are also easy to reuse.

Such functionality includes faceted search which serves as filters to narrow a search. These facets might need to be different based on what kind of role, department or background users have. Developers can create a set of facets and let the users choose the ones that satisfy their needs. Search driven portals is a great tool to make sure that sites don’t get flooded with information as new functionalities are developed. If a new need evolves, or if the provider comes with new ideas, the functionality is put into new searchlets which are deployed into the searchlet library. The administrator can broadcast new functionality to users by putting new searchlets on the master page, which affects every user’s own site. However, the users can still adjust new changes by removing the new functionality provided.

Search driven portals opens new ways of working, both in developer and usage perspective. It is one step away from the one size fits all concept, which many sites is supposed to fulfill. Providers such as Findwise can build a large component library which can be customized into packages for different customers. With help of the searchlet library, web administrators can set up designs for different groups, project managers can set up a project adjusted layout and employees can adjust their site after their own requirements. With search-driven portals, a wider range of users needs can more easily be covered.

Tobias Larsson Hult

To Crawl or not to Crawl

december 11 - 2009 | Tobias Larsson Hult

Having an Enterprise Search Engine, there are basically two ways of getting content into the index; using a web crawler or a connector. Both methods have their advantages and disadvantages. In this post I’ll try to poinpoint the differences with the two methods.

Web crawler

Most systems of today have a web-interface. Let it be your time reporting system, intranet, document management, you’ll probably access those with your web browser. Because of this, it’s very easy to use a web crawler to index this content as well.

The web crawler index the pages by starting at one page. From there, it follows all outbound links and index those. From those pages, it follows all links, and so on. This process continues until all links at a web site has been followed and the pages been indexed. The crawler thus uses the same technique as a human, visit a page and clicking the links.

Most Enterprise Search Engines are bundled with a web crawler. Thus, it’s usually very easy to get started. Just enter a start page and within minutes you’ll have searchable content in your index. No extra installation or license fee are required. For some sources, this may also be the only option, i.e if you’re indexing external sources that your company has no control of.

The main disadvantage though, is that web pages are designed for humans, not crawlers. This means that there are a lot of extra information for presentation purposes, such as navigation menus, sticky information messages, headers and footers and so on. All of this makes it a more pleasant experience for the user, and also making it easier to navigate on the page. The crawler on the other hand has no use of this information when retrieving pages. It’s actually reducing information quality in the index. For example, a navigation menu will be displayed on every page, thus the crawler will index the navigation content for all pages. So if you have a navigation item called ”Customers” and a user searches for customers, he/she will get a hit in ALL pages in the index.

There are ways to get around this, but it requires either altering of the produced HTML or adjustments in the search engine. Also, if the design of the site change, you have to do these adjustments again.

Connector

Even though the majority of systems has a web-interface, the content is stored in a data source of some format. It might be a database, structured file system, etc. By using a connector, you connect either to the underlying data source or to the system directly by its programming API.

Using a connector, the search engine does not get any presentation information but only the pure content, making the information quality in the index better. The connector can also retrieve all metadata associated with the information which further increases the quality. Often, you’ll also have more fine-grained control over what will be indexed with a connector than a web crawler.

Though, using a connector requires more configuration. It might also cost some extra money to buy one for your system, and require additional hardware. Though, once set up, it’s most likely to produce more relevant results compared to a web crawler.

Bottom line is it’s a consideration between quality and cost, as most decisions in life :)