XSLT Search returns strange data. Is it an å,ä,ö problem? Options
kalpa
Posted: Wednesday, November 21, 2007 10:00:39 PM

Rank: Fanatic

Joined: 7/19/2006
Posts: 496
Location: Göteborg, Sweden
Hi!

I just implemented the (great) XSLT Search on a site I'm working on.

If i search for a word with the letters å, ä or ö in it, it returns only one result, but if I search for the same word but omits the last part with "ö" in it I get 13 matches?

See for your self here (in progress): http://www.soms.nu/sok.aspx
This is a Swedish site for an non-profit association for physical therapists that work with people with different disabilities.

Words to test with might be:
(learning disability) "utvecklingsstörning" v/s "utvecklingsst" (1 v/s 13 matches)
(network) "nätverk" v/s "tverk" (first part omitted) (1 v/s 5 matches)

Obviously XSLT search finds the first word but of some reason it doesn't show the rest...

// ;) Kalle



" - Yeah I'd like to share your point of view, as long as it's my view too... (http://www.d-a-d.dk/lyrics/pointofview)
drobar
Posted: Wednesday, November 21, 2007 11:42:16 PM

Rank: Umbracoholic

Joined: 9/8/2006
Posts: 1,831
Location: MA, USA
Is this related to this post? http://forum.umbraco.org/17461

View the source of the pages that are and aren't returned and see if some are URL-escaped and some are not.

Let me know.

cheers,
doug.

MVP 2007-2009 - Percipient Studios
kalpa
Posted: Thursday, November 22, 2007 12:01:18 AM

Rank: Fanatic

Joined: 7/19/2006
Posts: 496
Location: Göteborg, Sweden
Yes, that's the same "feature" is seems, the page that actually was found contained unescaped chars somehow...

I think I'll try Mr Bock's little hack, unless 2.7 is just around the corner that is ; )

// ; ) Kalle

PS. Thanks for pointing me to that thread i totally missed that one :blush:


" - Yeah I'd like to share your point of view, as long as it's my view too... (http://www.d-a-d.dk/lyrics/pointofview)
tkahn
Posted: Thursday, November 22, 2007 8:56:44 PM

Rank: Fanatic

Joined: 11/24/2006
Posts: 323
Location: Stockholm, Sweden
I'm using XSLTSearch on a Swedish site and I'm not experiencing this problem. Test it at:

http://www.itresurs.se/sok

Perhaps it has something to do with the globalization settings in web.config? I have set all mine to UTF-8, like this:

<globalization requestEncoding="utf-8" responseEncoding="utf-8" fileEncoding="urf-8"/>

Still, I see inconsistencies in how the characters in the results are encoded sometimes I see no umlauted Swedish characters in the search results, sometimes only some of them are encoded and sometimes they are all encoded. Don't know why this is though...



Web Developer at Kärnhuset - http://www.karnhuset.net - Stockholm, Sweden
kalpa
Posted: Thursday, November 22, 2007 9:56:38 PM

Rank: Fanatic

Joined: 7/19/2006
Posts: 496
Location: Göteborg, Sweden
@Thomas: I'm sorry to say to actually do have the same issue there...

If you search first for "konsulttjänster" and for "konsulttj" you'll notice that the last one returns 4 matches while the first returns 2 matches...

And you can see that the 2 first results are for a match in the headline that (I guess) probably is a plain textstring while the word "konsulttjänster" in the body doesn't get matcheded at all...

This sounds somehow like the same kind of problem...

A question: I noticed that the latest news item contained some paragraphs with the class attribute "MsoNormal" is this a class defined by you or is it from the editor somehow?
(Cut'n'paste from word by the client perhaps?)

// ; ) Kalle



" - Yeah I'd like to share your point of view, as long as it's my view too... (http://www.d-a-d.dk/lyrics/pointofview)
tkahn
Posted: Friday, November 23, 2007 9:24:28 AM

Rank: Fanatic

Joined: 11/24/2006
Posts: 323
Location: Stockholm, Sweden
MsoNormal is something that comes from Microsoft Word and it's not removed by the code cleaner in Umbraco. It would be nice if it where removed but I don't know if I can configure Umbracos HTML-editor to do that?

Still - that is a sidetrack. The mail issue here is how to make XSLT Search handle Scandinavian characters better. Perhaps this could be a task for the Scandinavian users of Umbraco? I'm up to my neck in work for clients right now, but I could take a closer look at the code if I get some time.

Web Developer at Kärnhuset - http://www.karnhuset.net - Stockholm, Sweden
tkahn
Posted: Friday, November 23, 2007 9:54:33 AM

Rank: Fanatic

Joined: 11/24/2006
Posts: 323
Location: Stockholm, Sweden
Looking through the code, prehaps an idea would be to extend the following functions to cover umlauts for the Scandinavian characters as well:

Code:

    public string escapeSearchTerms(string data)
    {
        return data.Replace(Convert.ToString((char)38), "&amp;");
    }

    public string unescapeSearchTerms(string data)
    {
        return data.Replace("&amp;", Convert.ToString((char)38));
    }


The character code (38) is an ASCII number though and the Scandinavian characters do not exist in ASCII. Would something like this work?:

Code:

    public string escapeSearchTerms(string data)
    {
        return data.Replace(Convert.ToString((char)38), "&amp;");
        return data.Replace("ä", "&auml;");
        return data.Replace("å", "&aring;");
        return data.Replace("ö", "&ouml;");
    }

    public string unescapeSearchTerms(string data)
    {
        return data.Replace("&amp;", Convert.ToString((char)38));
        return data.Replace("&auml;", "ä");
        return data.Replace("&aring;", "å");
        return data.Replace("&ouml;", "ö");
    }


If I understand the structure of the code correctly this would umlaut the characters before they are used in the search? This means that if an "ä" is umlauted in the actual page content you would get a match. But if it's not umlauted (like in the case with the page headlines above) you wouldn't get a match.



Web Developer at Kärnhuset - http://www.karnhuset.net - Stockholm, Sweden
kalpa
Posted: Friday, November 23, 2007 10:16:30 AM

Rank: Fanatic

Joined: 7/19/2006
Posts: 496
Location: Göteborg, Sweden
You're on the right track there...

Further up in this thread Mr ImageGen'n'XSLT Search pointed me to this post about XSLT Search and Danish characters, it covers just what you've tried to accomplish here...

// ; ) Kalle

" - Yeah I'd like to share your point of view, as long as it's my view too... (http://www.d-a-d.dk/lyrics/pointofview)
Users browsing this topic
Guest


You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.