Search and text tools for umbraco Options
RyanRoberts
Posted: Monday, May 21, 2007 8:50:46 PM
Rank: Enthusiast

Joined: 7/25/2006
Posts: 14
Fellow Umbracans.

I have de-enterprised, upgraded to 2.0 and open sourced a project (which I chickened out of demonstrating at Codegarden) that allows you to do some pretty powerful things that require retrieving and relating Umbraco content and metadata with a minimal amount of effort and high performance, hopefully.

Get it here.

XSLT Lucene search extension

The core of the system is an extended (well, rewritten) Lucene query language hosted in an XSLT extension for easy integration.

Installation

Copy the binary files and 'DefaultStopWords.txt' to the bin directory of your umbraco installation.
Add the following line into the xsltExtensions element in config/xsltExtensions.config:
<ext assembly="/bin/SearchTools" type="SearchTools.Xsl" alias="stk" />

Useage

To use the search extensions from XSLT, first add xmlns:stk="urn:stk" to your xsl:stylesheet element. You should then be able to call the provided Search method from within your XSLT.

The search method takes two parameters; a query string and a result limit, it returns a nodeset containing results for each Umbraco node that matched the query.

Sample result set

The resultset returned

Code:

<root>
    <Result>
        <Score>0.23332</Score>
        <Field name="PageName">A result page</Field>
        <Field name="ADoctypeField">Itsvalue</Field>
                <Field name="AnotherDoctypeField">Itsvalue</Field>
    </Result>
</root>



Sample XSLT - fixed search for terms

The following XSLT searches the site index for the terms 'project' and 'management', returning up to 50 results in descending order of suitability. The first parameter to the , download the source and have a look at the test suite for an idea of what it should be capable of. Any feedback on the syntax would be appreciated at this early stage where it can be changed easily - I'm starting to doubt my choice of '&&' as the and operator when the queries are mainly going to be written inside xml attributes for one..

Simple queries take the form of fieldName:term or fieldName:"composite term", where the fieldname corresponds to a field in your sites Lucene index.

Code:

<xsl:template match="/">
    <ul>
      <xsl:for-each select="stk:Search('contents:project contents:management',50)">
    <li>
     <xsl:apply-templates/>
    </li>        
      </xsl:for-each>
    </ul>

    <xsl:template match = "Result">
            <xsl:apply-templates/>
    </xsl:template>

    <xsl:template match = "Score">
        (<xsl:value-of select="./text()"/>)
    </xsl:template>

    <xsl:template match = "Field">
        <xsl:if test="./@name = 'PageName'">
            <xsl:value-of select="./text()"/>
        </xsl:if>        
</xsl:template>


So far, so not very interesting. Though it might be useful in scenarios where you want to use metadata tags to have your pages appear in multiple sections of a site. I have added features to directly link umbraco content with the Lucene index as a part of your queries.

Sample query - link page fields

#injectMeta(PageUnstructured:tags,content)

This query will take the contents of the current page's 'tags' field and transform it into a query against the 'content' field, so it should return a ranked list of all the pages in your site that contain the pages' keywords in their content field.

Sample query - link member field

#injectMeta(MemberUnstructured:tags,content)

This query will take the contents of the current member's 'tags' field and transform it into a query against the 'content' field.

Sample query - composite metadata and boosting

The query language is pretty expressive - you can combine expressions with boolean logic, boost components of the expression and negate terms.

#injectMeta(PageUnstructured:tags,content)^3 && PageType:NewsItem && !(PageName:Introduction)

Should be a completely valid query where the metadata injection is boosted by 3x in the ranking, the PageType is required to be 'NewsItem' and the page name should not be 'Introduction'

Utility method - Baysian summariser

I have also added the simple baysian summariser from the nclassifier project. This can be used to automatically determine the most significant sentences from text - handy for the automatic production of teaser text.

Quote:

Being able to take a look at the words and phrases people use when looking for things online is invaluable. Rather than listening to people say what they think they might do, you get to observe what they actually did. And when aggregated, you get a nice view of the words people most often use when thinking about and searching for a certain topic.Once armed with keyword intelligence that’s relevant to your niche, you have the unique ability to create highly-relevant content that aids your site visitors and enhances your credibility. You’re speaking the language of the audience after all, and satisfying their needs.And if you get it right, you’ll likely rank well in the search engines too, after promoting the content in a strategic way. It may seem strange to view search traffic as a secondary benefit in a Google-driven world, but that’s exactly how you should view it. Google won’t treat you as relevant until someone else does first.


Will become:

Quote:

Being able to take a look at the words and phrases people use when looking for things online is invaluable. And when aggregated, you get a nice view of the words people most often use when thinking about and searching for a certain topic.Once armed with keyword intelligence that’s relevant to your niche, you have the unique ability to create highly-relevant content that aids your site visitors and enhances your credibility.


When summarised to 2 sentences.

Anybody who has feedback or ideas for this project or who wishes to contribute, please get in touch - ryansroberts at gmail dotcom.


jesper
Posted: Tuesday, May 22, 2007 4:27:52 PM

Rank: Administration

Joined: 7/25/2006
Posts: 426
Location: vipperoed, denmark
The feature list sounds very cool and the implementation is very cool (using xslt). What are the requirements?

Kindly,

Jesper

ps. I've tried with umbraco 2.15 running 2.0 without any luck

webbureau jesper.com doing webdesign / development / umbraco implementations / 2007&2008 MVP / umbraco certified
RyanRoberts
Posted: Tuesday, May 22, 2007 7:41:53 PM
Rank: Enthusiast

Joined: 7/25/2006
Posts: 14
I've only tried it in 3.0 RC. I'll grab a 2.1 release to try it in. Apologies for the broken, could you send me an exception trace if you have one?
RyanRoberts
Posted: Tuesday, May 22, 2007 8:29:02 PM
Rank: Enthusiast

Joined: 7/25/2006
Posts: 14
Ah, and another potential problem is that only a limited number of fields are indexed in umbraco by default - Content,Text,Id,User,CreateDate,SortText. I forgot that I used a modified version to get additional fields :blush:

So only these can be used as the target field of a search. You should still be able to use any field as a source of keywords however.

Searches like 'Content:google' and 'PageUnstructured:keywords,Content' should work fine.

You can use Luke to have a look at the index that Umbraco has produced in data/_systemUmbracoIndexDontDelete.

I'll have to chat to Niels about allowing additional document type fields to be indexed, as that allows you to do far nicer things with metadata than just using an aggregate content index. He's a little busy at the moment though I guess :)
jesper
Posted: Tuesday, May 22, 2007 11:30:12 PM

Rank: Administration

Joined: 7/25/2006
Posts: 426
Location: vipperoed, denmark
Ryan Roberts wrote:

I've only tried it in 3.0 RC. I'll grab a 2.1 release to try it in. Apologies for the broken, could you send me an exception trace if you have one?

I'll check if theres an exception tomorrow. But do I need to have lucene preinstalled or does your package include this?

Kindly,
Jesper

webbureau jesper.com doing webdesign / development / umbraco implementations / 2007&2008 MVP / umbraco certified
RyanRoberts
Posted: Tuesday, May 22, 2007 11:34:56 PM
Rank: Enthusiast

Joined: 7/25/2006
Posts: 14
The binary currently has lucene 2.0 ilmerged into it, so as not to clash with Umbraco's lucene dependency. You should only need the SearchTools assembly.
jesper
Posted: Wednesday, May 23, 2007 2:05:38 PM

Rank: Administration

Joined: 7/25/2006
Posts: 426
Location: vipperoed, denmark
Ryan Roberts wrote:

The binary currently has lucene 2.0 ilmerged into it, so as not to clash with Umbraco's lucene dependency. You should only need the SearchTools assembly.


Hi Ryan,

I installed Umbraco 3 RC and the websitewizard and finally copied your dll+ stopwords into bin.

Now .. ive created a simple xslt that works:

Code:

<xsl:copy-of select="stk:Search('Text:products', 201)"/>


This gives me one hit:

Code:

<Result><Score>0,9999999</Score><Field name="Id">1054</Field><Field name="Text">Products</Field><Field name="ObjectType">c66ba18e-eaf3-4cff-8a22-41b16d66a972</Field><Field name="User">umbraco_system</Field><Field name="CreateDate">20070523</Field><Field name="SortText">Products</Field><Field name="Content">Products </Field></Result>


So far so good. But I can't get it to work as you described. Meaning .. I can only make it hit on page names. Not content. I've tried with Content:products and get the same hit .. but there should have been more.

When I use Luke to examine the index it seems that both content, text, sortText contains the same value (the page name).

Kindly,
Jesper

webbureau jesper.com doing webdesign / development / umbraco implementations / 2007&2008 MVP / umbraco certified
RyanRoberts
Posted: Wednesday, May 23, 2007 3:20:01 PM
Rank: Enthusiast

Joined: 7/25/2006
Posts: 14
Very odd, I'll check the indexes produced by the codeplex RC when I get home.
imayat12
Posted: Wednesday, January 23, 2008 3:37:33 PM

Rank: Addict

Joined: 7/19/2006
Posts: 672
Location: Preston, UK
Jesper,

I have got this working. I had to submit patch. Ryan was looking at the internal index and not the external index that is created by umbracoUtilities / umbSearch. If you get the patch from codeplex and try again.

I reckon this is good enough for MVP08 nomination for Ryan. Im having a bit more of a play with queries and ive got to say im totally impressed :D

Ismail

Level 2 certified. If it aint broke dont fix.
seb
Posted: Thursday, January 31, 2008 4:27:04 PM

Rank: Devotee

Joined: 1/10/2008
Posts: 78
Location: London
Hi,

I'm trying to make this work but I'm stuck.
Ismael, I got your latest patch with the latest release from Ryan but I'm just not able to compile it. Can you make your dll available somehow?
Also, I tried to use Luke to see how my content was indexed.
The content seems pretty much fine but under the "Documents" tab, I have nothing. I have created some documents in the media library and created links to them on some pages, but their content doesn't seem to be indexed. Am I right assuming that I should see some of my media under the "Documents" tab in Luke?

Cheers,
Seb

http://www.be-k.net
imayat12
Posted: Thursday, January 31, 2008 5:08:04 PM

Rank: Addict

Joined: 7/19/2006
Posts: 672
Location: Preston, UK
Sebastian,

You can get my bin here do you have umbraco search from codeplex installed ie the indexer Niels wrote? This will index your nodes and media nodes. Creating links etc and content / media on its own will not get them into the index. You have to install that first. Also do you have index directory called umbSearch in your umbraco data directory becuase that is where it will be created.

If you need to look at my index then look on codeplex i have uploaded a dummy one for testing.

One thing to note the injectMeta stuff does not work yet but no doubt Ryan at some point work committments permitting sort it :yes:

Regards

Ismail

Level 2 certified. If it aint broke dont fix.
Users browsing this topic
Guest


You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.