How to Index and Search XML Content in Solr

January 25 - 2012 | Xiaodong Shen

Indexing XML Content

In solr, there is an xml update request handler which can be used to update xml formatted data.

For example,

<add>
<doc>
<field name="employeeId">05991</field>
<field name="office">Bridgewater</field>
<field name="skills">Perl</field>
<field name="skills">Java</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>

However when a field itself should contain xml formatted data, the xml update handler will fail to import. Because, xml update handler parse the import data with xml parser, it will try to get direct child text under ‘field’ node, which is empty if a field’s direct child is xml tag.

What we can do is to use json update handler. For example:

[
  {
    "id" : "MyTestDocument",
    "title" : "<root p=\"cc\">test \\ node</root>"
  }
]

There are two things to notice,

  1. Both ‘‘ and ‘\‘ characters should be escaped
  2. The xml content should be kept as a single line

Json import data can be loaded into Solr by the curl command,

curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json'

Or, by using solrj:

CommonsHttpSolrServer server = new CommonsHttpSolrServer(serverpath);
server.setMaxRetries(1);
ContentStreamUpdateRequest csureq = new ContentStreamUpdateRequest("/update/json");
csureq.addFile(file);
NamedList<Object> result = server.request(csureq);
NamedList<Object> responseHeader = (NamedList<Object>) result.get("responseHeader");

Integer status = (Integer) responseHeader.get("status");

Stripping out xml tags in Schema definition

When querying xml content, we most likely will not be interested in xml tags. So we need to strip out xml tags before indexing the xml text. We can do that by applying HTMLStripCharFilter to the xml content.
            <analyzer type="index">
                ...
                <charFilterSpellE">solr.HTMLStripCharFilterFactory"/>
                <tokenizerSpellE">solr.StandardTokenizerFactory"/>
                <filterSpellE">solr.LowerCaseFilterFactory"/>
                ...
            </analyzer>
            <analyzer type="query">
                ...
                <charFilterSpellE">solr.HTMLStripCharFilterFactory"/>
                <tokenizerSpellE">solr.StandardTokenizerFactory"/>
                <filterSpellE">solr.LowerCaseFilterFactory"/>
                ...
            </analyzer>

Search XML Content

Xml content search does not differ much from text content search. However, if people want to search for xml attributes, there requires some special tweak.

HTMLStripCharFilter we mentioned earlier will filter out all xml tags including attributes, in order to index attributes, we need to find a way to make HTMLStripCharFilter keep the attribute text.

For example if we have original xml content as following,

<sample attr=”key_o2_4”>find it </sample>
After applying HTMLStripCharFilter, we want to have,

key_o2_4    find it
One way we can do is to add assistance xml instruction tags in original xml content such as,

<sample attr=”key_o2_4”><?solr key_o2_4?>find it</sample>

And apply Solr.PatternReplaceCharFilterFactory to it as shown in following schema fieldtype definition.

<analyzer type="index">
...
<charFilter pattern="&lt;\?solr ([A-Z0-9_-]*)\?&gt; " replacement="       $1  " maxBlockChars="10000000"/>
<charFilter/>
...
</analyzer>

Which will make replace <?solr key_o2_4?> with 7 leading empty spaces + key_o2_4 + 2 ending empty spaces in order to keep the original offset,

With this technique, we can do a search on attr attribute and get a hit.

Topics:

Maggie Michnik

Search-driven Navigation and Content

January 19 - 2012 | Maggie Michnik

In the beginning of October I attended Microsoft SharePoint Conference 2011 in Anaheim, USA. There were a lot of interesting and useful topics that were discussed. One really interesting session was Content Targeting with the FAST Search Web Part by Martin Harwar.

Martin Harwar talked about how search can be used to show content on a web page. The most common search-driven content is of course the traditional search. But there are a lot more content that can be retrieved by search. One of them is to have search-driven navigation and content. The search-driven navigation means that instead of having static links on a page we can render them depending on the query the user typed in. If a user is for example on a health care site and had recently done a search on “ear infection” the page can show links to ear specialist departments. When the user will do another search and returns to the same page the links will be different.

In the same way we can render content on the page. Imagine a webpage of a tools business that on its start page has two lists of products, most popular and newest tools. To make these lists more adapted for a user we only want show products that are of interest for the user. Instead of only showing the most popular and newest tools the lists can also be filtered on the last query a user has typed. Assume a user searches on “saw” and then returns to the page with the product lists. The lists will now show the most popular saws and the newest saws. This can also be used when a user finds the companies webpage by searching for “saw” on for instance Google.

This shows that search can be used in many ways to personalize a webpage and thereby increase Findability.

Topics:

Kristian Norling

Text Analytics in Enterprise Search

January 11 - 2012 | Kristian Norling

A presentation made by Daniel Ling at Apache Lucene Eurocon in Barcelona, october 2011.

We think this is the first of many forthcoming presentations.

We also want to get more involved in the community in the future. By doing presentations, sponsoring, contributing code. Hope to bring more news on this subject in the next few weeks. Enjoy the presentation:

Text Analytics in Enterprise Search, Daniel Ling, Findwise, Eurocon 2011 from Lucene Revolution on Vimeo.

Topics:

Maggie Michnik

ExternalFileField in Solr

January 4 - 2012 | Maggie Michnik

Sometimes we want to update document values in an indexed field more often than other fields. A good solution to this is to use the field type ExternFileField. The ExternalFileField gets values from an external file instead of the index. Such file can easily be changed and update the field after a commit. Hence no documents need to be re-indexed. A field that has ExternalFileField as type is not searchable. The field may currently only be used as a ValueSource in a FunctionQuery.

The external file contains keys and values:

key1=value1
key2=value2

The keys don’t need to be unique.

The name of the external file must be external_<fieldname> or external_<fieldname>.* and must be placed in the index directory.

A new file type of the type ExternalFileField and field must be added to schema.xml.

<fieldType name="file"

           keyField="keyField" defVal="1" indexed="false"

           stored="false" valType="float" />

<field name="<fieldname>" type="file" />

keyField is the field that contains the keys and <fieldname> contains the values from the external file.

valType defines the value type of the field.

At Findwise we have used this method for a customer where we wanted to show the most visited pages higher up in the search result. These statistics are changing daily for a lot of pages and we don’t want to re-index all these pages every day.

Topics:

Leonard Saers

Analytics and BigData at IBM Information On Demand 2011

December 20 - 2011 | Leonard Saers

The big trend these days are in BigData and how you can analyze large amounts of information in order to gain important insights, and from those insights be able to take the right action. This trend was a hot topic at the IBM Information On Demand (IOD) conference in Las Vegas earlier this year. IBM has a very strong position in this field, it’s hard to have missed how their computer system Watson challenged the top players of all time in Jeopardy recently, and won! Read more about Watson

Now IBM has taken the technology behind Watson and started to apply it in their different analytics products, where one specific area that is being targeted is healthcare. For this area IBM released a new product during IOD called IBM Content and Predictive Analytics for Healthcare, which can for example be used as a tool for physicians to support them in their diagnosis of patients.

In April this year IBM merged two of their products, their search engine OmniFind and their product for analyzing large amounts of unstructured information, Content Analytics. The new product is called IBM Content analytics with Enterprise search and it too is based on much of the same technology that is used in Watson, more specifically it utilizes the same Natural Language Processing techniques. This means that it has the ability to understand text on a level just as sophisticated as that of Watson.

Content Analytics with enterprise search scales very well to many millions of documents. However, when there is a need for analyzing really enormous data sets, in the magnitude of petabytes or even exabytes, IBM has developed what they call their BigData platform. This platform mainly revolves around two products, InfoSphere Streams and InfoSphere BigInsights, and it builds on a foundation of open source software, such as Apache Hadoop and Apache Lucene. InfoSphere Streams is used for real time analysis of information in motion. This helps you understand what’s happening right at this moment in your organization and supports you in taking appropriate action as things are happening. InfoSphere BigInsights on the other hand lets you analyze and draw insight from massive amounts of already existing data.

Studies have shown how organizations that fall short in this area are overtaken by those who understand how to use the power of analytics.

IBM has surely chosen an interesting path when merging Analytics with Findability.

Topics:

Christian Ubbesen

Inspiration from the Enterprise Search Europe conference

November 11 - 2011 | Christian Ubbesen

A couple of weeks ago, me and some of my colleagues attended the Enterprise Search Europe conference in London. We’re very grateful to the organizer Martin White at IntranetFocus for arranging the event, and having us as one of the gold sponsors.

For me it was the first time in years I attended a conference like this, and while it was “same old, same old” for many of the attendees, for me it was enlightening to meet up with the industry and have a discussion on where we are as an industry.

There were mainly software vendors and professional services/consultants there, as well a few customers or actual users of enterprise search… and I think the consensus of the two days were that we in the industry STILL haven’t really figured out what we should do with the enterprise search concept, and how to make it valuable for our customers. We at Findwise are not alone with this challenge, but rather it is an industry challenge. There are some vendors who seem to be doing some good work of delivering real value to customers, and also there are a few colleagues to us in the industry that do good professional services/consultant work. At first it was a bit of a downer to realize that we haven’t progressed more during the 10 years I’ve been in the business, but at the same time it was very inspirational to see that we at Findwise together with a few other players, seem to be on the right track with our hard work, and that we have the position to solve some of the real industry challenges we’re facing.

As I see it, if we gather our forces and make a focused “push forward” together now, we will be able to take the industry to a new maturity level where we better solve real business challenges with enterprise search (or search-driven Findability solutions, as we like to call them).

My simple analysis of all the discussions at the conference is that we need to do two things:

  1. Manage the whole “full picture” of enterprise search – from strategy to organizational governance, involving necessary competencies to cover all aspects of a successful Findability solution.
  2. Break down the customer challenge into manageable chunks, and solve actual business problems, not just solving the traditional “finding stuff when needed” challenge.

I think we are on the right track, and it’s going to be a very interesting journey from here on!

Topics:

Christopher Wallstrom

Content choreography?

October 27 - 2011 | Christopher Wallstrom

Is getting the right content to the right users and customers a priority for you and your organisation? Do you drown in too much information? With some insight into how to manage content your answer is probably “Yes!”.

Today we have loads of channels to choose from, e-mails, internet/intranets, Yammer feeds, blogs and different collaboration platforms and social media services. Some content is more beneficial in one channel and other content in another channel. But how do you make sure the right information reaches the right users, in the right channels?

Content Choreography aims to handle all that; Content, strategy, format and delivery.

We need to tailor the user/customer experience in order to achieve good Findability. How? Taxonomy, Metadata and Search!
Taxonomy to ensure that we speak the same language, metadata to classify the content to fulfill a certain task or objective and search to deliver it to the right channel.

Need more information about Content Choreography?
Join us in our joint seminar with KnowIT, Nov 22nd: Future Choreography of Content Management, where Seth Earley – CEO at Early Associates will speak about Content Choreography – The Art of Dynamic Web Content. Seth Earley have more than 20 years experience in the field and is a very eloquent and interesting speaker. He will share his thoughts and ideas gathered from a number of large customers worldwide.

More information and registration can be found here.

Topics:

Ludvig Johansson

Contributor vs. Consumer

October 25 - 2011 | Ludvig Johansson

A couple of weeks ago I had the opportunity to attend the Microsoft SharePoint Conference 2011, Anaheim USA. This turned out to be an intense four-day conference covering just about any SharePoint 2010 topic you can imagine – from the geekiest developer session to business tracks with lessons learned.

To me, one of the most memorable sessions where Social Search with Dan Benson and Paul Summers, in which they showed us how social behaviours can be used to influence the current rank of search. For instance, users interests entered in MySite can be used to boost (xrank) search results accordingly. This was an eye opener as it illustrated what’s possible with quite easy means. Thanks for that!

Another great session was Scott Jamison talking about Findability in SharePoint. The key ingredient in this session was to differentiate between contributor and consumer. Typically we focus on the contributor, building 100 level folder structures with names that make sense to contributor. However, we seem to forget about the consumers, who of course are the other key aspect of an intranet. It is equally important to create a good support system for contributors, as it is to focus on consumer needs. As Jamison said “why have folders for both contributors and consumers? ”. SharePoint includes endless possibilities when it comes to creating logical views built on search, tags and filtering aimed to fill the needs of the consumers.

So, keep the folders or what ever support the contributor needs, but let your imagination float free for delivering best class Findability to the consumer!

Topics:

Anders Rask

Distributed processing + search == true?

September 30 - 2011 | Anders Rask

In June 2011, I attended the Berlin Buzzwords conference. The main theme of the conference was undoubtedly the current paradigm shift in distributed processing, driven by the major success of Hadoop. Doug Cutting – founder of Apache projects such as Lucene, Nutch and Hadoop – held one of the keynotes. He focused on what he recognized as the new foundations for this paradigm shift:

- Commodity hardware
- Sequential file access
- Sharding
- Automated, high level reliability
- Open source

Distributed processing is done fairly well with Hadoop. Distributed search on the other hand is more or less limited to sharding and/or replicating the index. The downside of sharding is that you perform the same search on multiple servers and then need to combine the results. Due to the nature of algorithms in search such as tf/idf, tasks like ranking results suffers. Andrzej Białecki (another frequent Lucene committer) held a presentation on this topic, and his view can be summarized as: Use local search as long as you can, distribute only when the cost of local search limitations outweighs the cost of distributed search.

The setup of automated replication and sharding, with help from Zookeeper in the Solr Cloud project, is a major step in the right direction but the question on how to properly combine search results from different nodes still remains. One thing is sure though, there is a lot of interesting work being done in this area.

Topics:

Caroline Abrahamsson

Enterprise search – market overview 2011

September 26 - 2011 | Caroline Abrahamsson

A few weeks ago Forrester research released a report with an overview of the 12 leading Enterprise search vendors on the global market (Attivio, Autonomy, Coveo, Endeca, Exalead, Fabasoft, Google, IBM, ISYS Search, Microsoft, Sinequa and Vivisimo).

When I wrote about the Gartner report, readers commented on the fact that open source solutions were not part of the scope, even though their market share is increasing rapidly. The Forrester report has the same approach, except it includes vendors offering their products stand-alone as well as those with products integrated in portal/ECM solutions.

So why the exclusion of open source? Well, it appears difficult to decide on how to evaluate open source, especially when it comes to more advanced appliances.

Looking at the Forrester report, it includes some familiar conclusions but also a few new insights. Leslie Owen from Forrester concludes that “Google, Autonomy, and Microsoft are the most well-known names; they own a large portion of the existing market”. Hence, these vendors are still standing strong, even though they are challenged in various areas.

More surprisingly, some niche players get higher scores than the giants in core areas such as “Indexing and connectivity”, “Interface flexibility” and “Social and collaborative features”.

Vivisimo is seen as somewhat of a leader (with a slightly lower score on Mobile support and Semantics/text analysis). In the Gartner report, Vivisimo was excluded from the information access evaluation due to the fact that they were ”focusing on specialized application categories, such as customer service”.

Search vendor overview

An interesting reflection from Forrester is that “in the next few years, we expect prices to rise as specialized vendors wax poetic on the transformative power of search in order to distinguish their products from Google and Microsoft FAST Search for SharePoint”. On the Nordic market, we have not seen a shift to such a strategy, but rather the opposite, since open source (with zero license fees) is becoming accepted in an Enterprise environment to a larger extent.

The vendors that provide integrated solutions (to CMS/WCM etc) still remains strong, whereas the stand-alone solutions becomes exposed to completion in new ways. It will be interesting to follow the US and Nordic market to see how this evolves within the next year. It might be that the market differs when it comes to open source adaption.

If you wish to read the full report it can be downloaded from Vivisimo through a simple registration.
To get a complete overview of vendors, I recommend reading both the Gartner and Forrester report.

Topics: