<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Solr Processing Pipeline</title>
	<atom:link href="http://findabilityblog.se/solr-processing-pipeline/feed/" rel="self" type="application/rss+xml" />
	<link>http://findabilityblog.se/solr-processing-pipeline/</link>
	<description>the search and findability blog</description>
	<lastBuildDate>Sun, 29 Jan 2012 11:05:00 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Processing pipeline for the Google Search Appliance &#171; The Findability blog</title>
		<link>http://findabilityblog.se/solr-processing-pipeline/#comment-28</link>
		<dc:creator>Processing pipeline for the Google Search Appliance &#171; The Findability blog</dc:creator>
		<pubDate>Fri, 08 Oct 2010 09:24:45 +0000</pubDate>
		<guid isPermaLink="false">http://findabilityblog.se/?p=1952#comment-28</guid>
		<description>[...]  oktober 8 - 2010 &#124;  Tobias Larsson Hult    Max has previously highlighted the subject of a processing pipeline for Apache Solr. Another enterprise search engine that is lacking this feature is the Google Search Appliance [...] </description>
		<content:encoded><![CDATA[<p>[...]  oktober 8 &#8211; 2010 |  Tobias Larsson Hult    Max has previously highlighted the subject of a processing pipeline for Apache Solr. Another enterprise search engine that is lacking this feature is the Google Search Appliance [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Fredrik Rødland</title>
		<link>http://findabilityblog.se/solr-processing-pipeline/#comment-27</link>
		<dc:creator>Fredrik Rødland</dc:creator>
		<pubDate>Thu, 20 May 2010 14:27:35 +0000</pubDate>
		<guid isPermaLink="false">http://findabilityblog.se/?p=1952#comment-27</guid>
		<description>Hi Max.

Thanks for an interesting talk about this topic at eurocon. Integrating the pipeline with UpdateRequestProcessorChain as Jan mentions is actually what I was thinking about talking to you after the talk.  Also for simple needs - why not only implement a (few) stage/functionality directly as a stage in this chain (sub-classing UpdateRequestProcessorFactory) and configuring it in (a possibly seperate chain in) solrconfig.xml?</description>
		<content:encoded><![CDATA[<p>Hi Max.</p>
<p>Thanks for an interesting talk about this topic at eurocon. Integrating the pipeline with UpdateRequestProcessorChain as Jan mentions is actually what I was thinking about talking to you after the talk.  Also for simple needs &#8211; why not only implement a (few) stage/functionality directly as a stage in this chain (sub-classing UpdateRequestProcessorFactory) and configuring it in (a possibly seperate chain in) solrconfig.xml?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jan Høydahl</title>
		<link>http://findabilityblog.se/solr-processing-pipeline/#comment-26</link>
		<dc:creator>Jan Høydahl</dc:creator>
		<pubDate>Tue, 11 May 2010 18:32:06 +0000</pubDate>
		<guid isPermaLink="false">http://findabilityblog.se/?p=1952#comment-26</guid>
		<description>Good to see you continue the thinking around this important topic, Max.

I looked at commons pipeline, and what it does particularly well is scalability in terms of the individual processing stages - each stage can have its own thread! - and queuing between stages etc. Also the parallell nature allows for optimized utilization for heavy processing, although for many search use cases order is important so I don&#039;t know about the usefulness of this.

Carl M: I did not know about SMILA. Impressive. However, it feels way too heavy for what majority of Solr installations will be needing. I foresee SMILA useful only for the largest integrations where they are already using BPEL, not for the simple processing needs of a local search application. Also it smells a bit over-engineered?

Back to your suggestion Max. If the pipeline is to live strictly outside of Solr it makes sense to speak Solr-API in both ends. However, you then short-circuit all other UpdateRequestHandlers from benefiting from the pipeline. DIH, CSV, Extracting, (Java)Binary and other handlers cannot use it.

To overcome this, instead of integrating the pipeline as a standalone service, integrate it into Solr&#039;s UpdateRequestProcessorChain, which sits between the RequestHandlers and indexing. There could be two versions of the pipeline factory - one Local, which executes the pipeline in same thread, and one Remote which streams the documents to a dedicated processing node/cluster.

I think this plays better with the current and future Solr architecture because
* The pipeline will be truly transparent, and ALL current RequestHandlers can be used, including DIH and LCF
* Solr will get built-in shard routing logic, obeying the new concepts of collections etc from SolrCloud
* It makes sense to have the choice of running a light-weight single-node without unnecessary HTTP calls, but also have the possibility of scaling out

Another issue is that the REST document model and SolrInputDocument is not (currently) rich enough to hold metadata about a partly processed document. I think it is unavoidable at some point to at least do Tokenization in the pipeline. Then we need to pass on a document with both the original version of the field as well as the tokenized version with metadata telling that it is already tokenized. The Lucene analysis chain could then skip tokenization. This is the way OpenPipe (http://openpipe.berlios.de/) is integrated with Solr. They found the need to invent a custom binary protocol to convey such metadata...

I really like the idea of using commons-pipeline, keep up the thinking :)</description>
		<content:encoded><![CDATA[<p>Good to see you continue the thinking around this important topic, Max.</p>
<p>I looked at commons pipeline, and what it does particularly well is scalability in terms of the individual processing stages &#8211; each stage can have its own thread! &#8211; and queuing between stages etc. Also the parallell nature allows for optimized utilization for heavy processing, although for many search use cases order is important so I don&#8217;t know about the usefulness of this.</p>
<p>Carl M: I did not know about SMILA. Impressive. However, it feels way too heavy for what majority of Solr installations will be needing. I foresee SMILA useful only for the largest integrations where they are already using BPEL, not for the simple processing needs of a local search application. Also it smells a bit over-engineered?</p>
<p>Back to your suggestion Max. If the pipeline is to live strictly outside of Solr it makes sense to speak Solr-API in both ends. However, you then short-circuit all other UpdateRequestHandlers from benefiting from the pipeline. DIH, CSV, Extracting, (Java)Binary and other handlers cannot use it.</p>
<p>To overcome this, instead of integrating the pipeline as a standalone service, integrate it into Solr&#8217;s UpdateRequestProcessorChain, which sits between the RequestHandlers and indexing. There could be two versions of the pipeline factory &#8211; one Local, which executes the pipeline in same thread, and one Remote which streams the documents to a dedicated processing node/cluster.</p>
<p>I think this plays better with the current and future Solr architecture because<br />
* The pipeline will be truly transparent, and ALL current RequestHandlers can be used, including DIH and LCF<br />
* Solr will get built-in shard routing logic, obeying the new concepts of collections etc from SolrCloud<br />
* It makes sense to have the choice of running a light-weight single-node without unnecessary HTTP calls, but also have the possibility of scaling out</p>
<p>Another issue is that the REST document model and SolrInputDocument is not (currently) rich enough to hold metadata about a partly processed document. I think it is unavoidable at some point to at least do Tokenization in the pipeline. Then we need to pass on a document with both the original version of the field as well as the tokenized version with metadata telling that it is already tokenized. The Lucene analysis chain could then skip tokenization. This is the way OpenPipe (<a href="http://openpipe.berlios.de/" rel="nofollow">http://openpipe.berlios.de/</a>) is integrated with Solr. They found the need to invent a custom binary protocol to convey such metadata&#8230;</p>
<p>I really like the idea of using commons-pipeline, keep up the thinking <img src='http://findabilityblog.se/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Hannes Carl Meyer</title>
		<link>http://findabilityblog.se/solr-processing-pipeline/#comment-25</link>
		<dc:creator>Hannes Carl Meyer</dc:creator>
		<pubDate>Wed, 21 Apr 2010 16:13:14 +0000</pubDate>
		<guid isPermaLink="false">http://findabilityblog.se/?p=1952#comment-25</guid>
		<description>The whole picture of akquiring information (e.g. Webcrawler), analysis, and indexing is currently covered as a Eclipse incubating project called SMILA: http://www.eclipse.org/smila/
SMILA means SeMantic Information Logistics Architecture - which is a pretty neat idea but hard to realize. For the plug-in system of components for analysis (categorization, language detection etc.) they rely on OSGi which makes it possible to change components and the behaviour during runtime.

Currently under (heavy?) development by two companies from germany (Empolis, Brox).</description>
		<content:encoded><![CDATA[<p>The whole picture of akquiring information (e.g. Webcrawler), analysis, and indexing is currently covered as a Eclipse incubating project called SMILA: <a href="http://www.eclipse.org/smila/" rel="nofollow">http://www.eclipse.org/smila/</a><br />
SMILA means SeMantic Information Logistics Architecture &#8211; which is a pretty neat idea but hard to realize. For the plug-in system of components for analysis (categorization, language detection etc.) they rely on OSGi which makes it possible to change components and the behaviour during runtime.</p>
<p>Currently under (heavy?) development by two companies from germany (Empolis, Brox).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Max Charas</title>
		<link>http://findabilityblog.se/solr-processing-pipeline/#comment-24</link>
		<dc:creator>Max Charas</dc:creator>
		<pubDate>Wed, 21 Apr 2010 12:54:30 +0000</pubDate>
		<guid isPermaLink="false">http://findabilityblog.se/?p=1952#comment-24</guid>
		<description>Hi Eric,
Yeah that’s kinda of my vision. The LCF is a place to build stable connectors and the Pipeline is a playground to share small tidbits of code to manipulate your data. Stuff like:

- Categorization.
- Language detection.
- Different language processing depending on the language detection.
- And so much more.

When it comes to the DIH I´m no fan of it. I really have no reason for not using it, but it just seems complicated. 

And yes, it’s another moving part of Solr, but (!) highly optional. This type of product would only be used for applications where you need to manipulate your data a lot. In for example large enterprise applications, wiki´s, library content etc. etc.</description>
		<content:encoded><![CDATA[<p>Hi Eric,<br />
Yeah that’s kinda of my vision. The LCF is a place to build stable connectors and the Pipeline is a playground to share small tidbits of code to manipulate your data. Stuff like:</p>
<p>- Categorization.<br />
- Language detection.<br />
- Different language processing depending on the language detection.<br />
- And so much more.</p>
<p>When it comes to the DIH I´m no fan of it. I really have no reason for not using it, but it just seems complicated. </p>
<p>And yes, it’s another moving part of Solr, but (!) highly optional. This type of product would only be used for applications where you need to manipulate your data a lot. In for example large enterprise applications, wiki´s, library content etc. etc.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Eric Pugh</title>
		<link>http://findabilityblog.se/solr-processing-pipeline/#comment-23</link>
		<dc:creator>Eric Pugh</dc:creator>
		<pubDate>Wed, 21 Apr 2010 12:20:19 +0000</pubDate>
		<guid isPermaLink="false">http://findabilityblog.se/?p=1952#comment-23</guid>
		<description>I can see some pro&#039;s and con&#039;s of your idea

Pros:
 - Speaking Solr at both ends makes it much easier to play with.  One of the things I love about Solr is how simple it is to start using it.  No need to set up crazy message queues, complex documents, etc.  Just pop a document in.  So this would lower the barrier to entry.  
 - Give a hook into what is going into Solr without changing Solr itself.  I can see putting this as a frontend and just logging/recording what is going through.  Most of us probably can not easily tell &quot;what docs changed in the last 15 minutes&quot; for example.  
- Poor mans ESB...   When I index a document of type X, also trigger this other job or something can go there.
- Maybe a simpler way to share processing steps by fitting into a standard pipeline?

Cons:
 - Doesn&#039;t this to some extent duplicate the index time pipeline inside of Solr for various document types?  
 - Another moving part for Solr?
 - Are you duplicating a lot of DIH does in terms of processing logic?  Although, I actually don&#039;t really like the DIH approach from an approach.  It seems like it solves a specific issue well, but then the more processing you do, the more you are writing code/logic using complex XML!

Something about your proposal sounded familiar, and I looked back at some projects and a couple of years ago built a data processing pipeline using Apache Commons Chain.  An XML doc defined the pipeline and any specific values per step, and then you could have multiple pipelines.  The ability to have multiple ones and try one versus the other was great.  Love to see where this goes.  Maybe replace DIH&#039;s processing layer?

I know every time I write an indexer for a new project, I often put some of the same code over and over!</description>
		<content:encoded><![CDATA[<p>I can see some pro&#8217;s and con&#8217;s of your idea</p>
<p>Pros:<br />
 &#8211; Speaking Solr at both ends makes it much easier to play with.  One of the things I love about Solr is how simple it is to start using it.  No need to set up crazy message queues, complex documents, etc.  Just pop a document in.  So this would lower the barrier to entry.<br />
 &#8211; Give a hook into what is going into Solr without changing Solr itself.  I can see putting this as a frontend and just logging/recording what is going through.  Most of us probably can not easily tell &#8220;what docs changed in the last 15 minutes&#8221; for example.<br />
- Poor mans ESB&#8230;   When I index a document of type X, also trigger this other job or something can go there.<br />
- Maybe a simpler way to share processing steps by fitting into a standard pipeline?</p>
<p>Cons:<br />
 &#8211; Doesn&#8217;t this to some extent duplicate the index time pipeline inside of Solr for various document types?<br />
 &#8211; Another moving part for Solr?<br />
 &#8211; Are you duplicating a lot of DIH does in terms of processing logic?  Although, I actually don&#8217;t really like the DIH approach from an approach.  It seems like it solves a specific issue well, but then the more processing you do, the more you are writing code/logic using complex XML!</p>
<p>Something about your proposal sounded familiar, and I looked back at some projects and a couple of years ago built a data processing pipeline using Apache Commons Chain.  An XML doc defined the pipeline and any specific values per step, and then you could have multiple pipelines.  The ability to have multiple ones and try one versus the other was great.  Love to see where this goes.  Maybe replace DIH&#8217;s processing layer?</p>
<p>I know every time I write an indexer for a new project, I often put some of the same code over and over!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Max Charas</title>
		<link>http://findabilityblog.se/solr-processing-pipeline/#comment-22</link>
		<dc:creator>Max Charas</dc:creator>
		<pubDate>Wed, 21 Apr 2010 09:27:36 +0000</pubDate>
		<guid isPermaLink="false">http://findabilityblog.se/?p=1952#comment-22</guid>
		<description>Hannes,
Unfortunately I haven’t worked with UIMA yet, but I´ve hear a lot about it. When it comes to hot plugging new steps at runtime I totally agree with you. I’ve worked in another big enterprise project where we developed a Pipeline framework with a Domain Specific Language that was hot pluggable, came in handy a lot.</description>
		<content:encoded><![CDATA[<p>Hannes,<br />
Unfortunately I haven’t worked with UIMA yet, but I´ve hear a lot about it. When it comes to hot plugging new steps at runtime I totally agree with you. I’ve worked in another big enterprise project where we developed a Pipeline framework with a Domain Specific Language that was hot pluggable, came in handy a lot.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Max Charas</title>
		<link>http://findabilityblog.se/solr-processing-pipeline/#comment-21</link>
		<dc:creator>Max Charas</dc:creator>
		<pubDate>Wed, 21 Apr 2010 09:25:22 +0000</pubDate>
		<guid isPermaLink="false">http://findabilityblog.se/?p=1952#comment-21</guid>
		<description>Otis, you are of course right. On the end of LCF one would use the API that fits LCF best.

But I think that most people would have their own &quot;hacked&quot; connectors that speak “Solr-REST”. And in that case it would be nice to have a pipeline that acts as a Solr on both sides. A more or less totally transparent pipeline.</description>
		<content:encoded><![CDATA[<p>Otis, you are of course right. On the end of LCF one would use the API that fits LCF best.</p>
<p>But I think that most people would have their own &#8220;hacked&#8221; connectors that speak “Solr-REST”. And in that case it would be nice to have a pipeline that acts as a Solr on both sides. A more or less totally transparent pipeline.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Hannes Carl Meyer</title>
		<link>http://findabilityblog.se/solr-processing-pipeline/#comment-20</link>
		<dc:creator>Hannes Carl Meyer</dc:creator>
		<pubDate>Tue, 20 Apr 2010 12:47:57 +0000</pubDate>
		<guid isPermaLink="false">http://findabilityblog.se/?p=1952#comment-20</guid>
		<description>Just forgot about the main feature/requirement for such systems: changing the behaviour of the processing pipeline during runtime.</description>
		<content:encoded><![CDATA[<p>Just forgot about the main feature/requirement for such systems: changing the behaviour of the processing pipeline during runtime.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Otis Gospodnetic</title>
		<link>http://findabilityblog.se/solr-processing-pipeline/#comment-19</link>
		<dc:creator>Otis Gospodnetic</dc:creator>
		<pubDate>Mon, 19 Apr 2010 19:29:25 +0000</pubDate>
		<guid isPermaLink="false">http://findabilityblog.se/?p=1952#comment-19</guid>
		<description>LCF is pretty new, so give it a bit of time...
SolrJ on the LCF end seems a little funky.  That would mean LCF REST API would have to at least use the same response format?  Would that make sense for LCF?  I guess it would....

Let&#039;s see what others say:
http://search-lucene.com/m/B1LSk1RL4BN</description>
		<content:encoded><![CDATA[<p>LCF is pretty new, so give it a bit of time&#8230;<br />
SolrJ on the LCF end seems a little funky.  That would mean LCF REST API would have to at least use the same response format?  Would that make sense for LCF?  I guess it would&#8230;.</p>
<p>Let&#8217;s see what others say:<br />
<a href="http://search-lucene.com/m/B1LSk1RL4BN" rel="nofollow">http://search-lucene.com/m/B1LSk1RL4BN</a></p>
]]></content:encoded>
	</item>
</channel>
</rss>

