Archive for the ‘Lucene’ Category

Kristian Norling

Text Analytics in Enterprise Search

January 11 - 2012 | Kristian Norling

A presentation made by Daniel Ling at Apache Lucene Eurocon in Barcelona, october 2011.

We think this is the first of many forthcoming presentations.

We also want to get more involved in the community in the future. By doing presentations, sponsoring, contributing code. Hope to bring more news on this subject in the next few weeks. Enjoy the presentation:

Text Analytics in Enterprise Search, Daniel Ling, Findwise, Eurocon 2011 from Lucene Revolution on Vimeo.

Anders Rask

Apache Nutch making use of Open Pipeline

November 11 - 2010 | Anders Rask

During the last couple of months I’ve been working on a project for Uppsala University. The project’s goal is to improve the findability on the university web site. The solution that we are working on is based on Apache Nutch 1.1 in conjunction with Apache Solr 1.4. Nutch provides us with a robust web crawler that scales very well and also gives us a page rank for each page that we can use for relevance tuning. Besides the web information crawled by Nutch, the search application will also be used to search people and organizational information that we index from another source. I thought that I would share some details on how we are using Nutch.

We have made two extensions to Nutch, one is a parser plug-in that can run Open Pipeline embedded in it. This was an important extension in order to get better control of the information that we index to Solr and also to be able to reuse our different Open Pipeline components. The main stages of the pipeline are the following:

  1. Extract the encoding of a web page
  2. Extract all links from a web page
  3. Extract all headings (hx) from a web page
  4. Remove all tags that don’t contain complete sentences on a web page
  5. Extract text and metadata from different types of documents with Tika
  6. Do some metadata mapping and cleaning
  7. Populate facets according to metadata and/or URL
  8. Do static URL ranking
  9. Replace certain common titles with the largest heading of the web page

The other extension we made to Nutch is an indexing filter that makes sure all our metadata fields are indexed to Solr.

So far so good. The fetching, parsing and indexing works well now and currently our largest challenge is tuning all the different relevance parameters we have, as well as harmonizing the relevance of web information to that of people and organizational information. I will have to get back to you on how that went!

Max Charas

Structure First or Structure Last?

October 17 - 2010 | Max Charas

I’d like to share two different development techniques I commonly use when setting up a Apache Solr project. To explain it I’ll start by introducing the way I used to work. (The wrong way ;) )

The Structure First Technique

Since I work as a search consultant I come across a lot of different data sources.  All of these data sources have at least some structure, some more than others.

My objective as a backend developer was then to first of all figure out how the data source was structured and then design a Solr schema that fit the requirements, both technical and business.

The problem with this was of course that the requirements were quite fuzzy until I actually figured out how the data was structured and even more importantly what the data quality was.

In many cases I would spend a lot of time on extracting a date from the source, converting that to an ISO 8601 date format (Supported by Solr), updating the schema with that field and then finally reindexing. Only to learn that the date was either not required or had too poor data quality to be used.

My point being that I spent a lot of time designing a schema (and connector) for a source which I, and most others, knew almost nothing about.

The Structure Last Technique

Ok so what’s the supposed “right way” of doing this?

In Solr there is a concept called dynamic fields. It allows you to map fields that fulfil a certain name criteria to a specific type. In the example Solr schema you can find the following section:

<!– uncomment the following to ignore any fields that don’t already match an existing

field name or dynamic field, rather than reporting them as an error.

alternately, change the type=”ignored” to some other type e.g. “text” if you want

unknown fields indexed and/or stored by default –>

<!–dynamicField type=”ignored” multiValued=”true” /–>

The section above will drop any fields that are not explicitly declared in the schema. But what I usually do to start with is to do the complete opposite. I map all fields to a string type.

<dynamicField multiValued=”true” indexed=”true” stored=”true”/>

I start with a minimalist schema that only has an id field and the above stated dynamic field.

With this schema it doesn’t matter what I do, everything is mapped to a string field, exactly as it is entered.

This allows me to focus on getting the data into Solr without caring about what to name the fields, what properties they should have and most importantly to even having to declare them at all.

Instead I can focus on getting the data out of the source system and then into Solr. When that’s done I can use Solr´s schema browser to see what fields are high quality, contain a lot of text or are suited to be used as facets and use this information to help out in the requirements process.

The Structure Last Technique lets you be more pragmatic about your requirements.

Maria Johansson

Faceted Search by LinkedIn

March 12 - 2010 | Maria Johansson

My RSS feeds have been buzzing about the LinkedIn faceted search since it was first released from beta in December. So why is the new search at LinkedIn so interesting that people are almost constantly discussing it? I think it’s partly because LinkedIn is a site that is used by most professionals and searching for people is core functionality on LinkedIn. But the search interface on LinkedIn is also a very good example of faceted search.

I decided to have a closer look into their search. The first thing I realized was just how many different kinds of searches there are on LinkedIn. Not only the obvious people search but also, job, news, forum, group, company, address book, answers and reference search. LinkedIn has managed to integrate search so that it’s the natural way of finding information on the site. People search is the most prominent search functionality but not the only one.

I’ve seen several different people search implementations and they often have a tendency to work more or less like phone books. If you know the name you type it and get the number. And if you’re lucky you can also get the name if you only have the number. There is seldom anyway to search for people with a certain competence or from a geographic area. LinkedIn sets a good example of how searching for people could and should work.

LinkedIn has taken careful consideration of their users; What information they are looking for, how they want it presented and how they need to filter searches in order to find the right people. The details that I personally like are the possibility to search within filters for matching options (I worked on a similar solution last year) and how different filters are displayed (or at least in different order) depending on what query the user types. If you want to know more about how the faceted search at LinkedIn was designed, check out the blog post by Sara Alpern.

But LinkedIn is not only interesting because of the good search experience. It’s also interesting from a technical perspective. The LinkedIn search is built on open source so they have developed everything themselves. For those of you interested in the technology behind the new LinkedIn search I recommend “LinkedIn search a look beneath the hood”, by Daniel Tunkelang where he links to a presentation by John Wang search architect at LinkedIn.

Caroline Abrahamsson

How to create better search – VGR leads the way

January 11 - 2010 | Caroline Abrahamsson

I realise we are a bit late. Fredrik Wackå, a senior IT-strategist, has already written an excellent article on his blog (in Swedish). He has, among other things, been interviewing Kristian Norling (at Twitter), who has been working with portal strategies and search for many years at Västra Götalands regionen.
Although, for all our non-Swedish speaking guests here is a short summary:

Findwise has during the last few months been working on a new search solution for Västra Götalands regionen.  The two main goals have been to deliver a search experience that seems both fast and accurate.
The result?
Today making a search at VGR takes about 0,1-0,2 seconds, faster than a Google search on the web.

Furthermore, there was a need for context. Large amount of information requires ways to filter and sort – otherwise the users will drown in the result list.
By giving the end-users the ability to sort the search result the users can look for general information within an area as well as quickly narrow down to a specific piece (for example by two clicks be able to see only the PDF-files created in 2009). The filters (and thereby metadata standard) includes:

• Information type
• Where the document resides
• Where it belongs in the organization
• What source it has
• When it was last changed
• Who has written it
• What format it resides in
• Keywords that has been created

VGR

VGR

The search solution also includes a metadata service. As so many others VGR has been struggling with getting the metadata in place.
Apart from the metadata supported by the system (where Dublin Core is being used) the metadata service is doing two things:
• Analyses the content in the text, compares it to taxonomy and gives the writer suggestions of keywords that he/she can use
• Gives the writer the ability to add additional keywords

Apart from this the end-users will be able to add etiquettes (tags). These will be compared with two lists. If the tags appears in the “white list” it will be published right away, if they are in the “blacklist” they will be deleted. Anything inbetween are controlled before they are published.

To conclude: a lot of effort has been put into creating a good search experience and VGR continues to deliver functionality and solutions that are light-years ahead of many others. The combination of supporting systems and using the “collected intelligence” of the writers and end-users will make it even better over time.
Search is about both supporting systems, content and people.

Read more in Fredrik Wackås blog