Truncated text during Xapian indexing

Discussion:

Sebastian Hagedorn

2018-02-15 09:44:03 UTC

Hi,

as a follow-up to a discussion on IRC, I noticed the following diagnostic
log line while replicating mailboxes to a new server with Xapian:

"Xapian: truncating text from message mailbox âŠ"

Nicola replied that to her knowledge only the first 4 MB of a message are
indexed, which led to these comments (which I hope it is OK to copy from
the IRC channel):

^Simon^: Is that the first 4Mb of the text/html and/or text/plain parts, or
first 4Mb of the entire message body, ignoring any mime parts?
[2:34pm]
onlight: hagedose: ^Simon^: That's a darn good question. This needs to be
wrapped into the documentation. Thanks to all for this thread!
[4:19pm]
^Simon^: If it's just the first 4Mb of the body you could run into issues
where the body is encoded in some way (eg quoted-printable)
nicola_fm: For a faster response, drop some queries about cyrus and xapian
on the mailing list. I am a poor proxy for sending messages to Robert S!

As suggested by Nicola, I am taking it to the list :-)

--
.:.Sebastian Hagedorn - Weyertal 121 (GebÃ€ude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
.:.UniversitÃ€t zu KÃ¶ln / Cologne University - â +49-221-470-89578.:.

Robert Stepanek

2018-02-15 10:20:32 UTC

Permalink

Post by Sebastian Hagedorn
^Simon^: Is that the first 4Mb of the text/html and/or text/plain parts, or
first 4Mb of the entire message body, ignoring any mime parts?

This limit defines the maximum byte length per MIME body-part of type "text". The byte length is calculated after decoding (e.g. quoted-printable), conversion to UTF-8 and search text normalisation (e.g. stripping HTML tags, replacing Umlaut characters with their ASCII counterparts, etc.). Actually, it also applies to any other search-indexed fields, such as subjects, headers, etc. but in practice only is relevant for mail bodies.

Post by Sebastian Hagedorn
nicola_fm: For a faster response, drop some queries about cyrus and xapian
on the mailing list. I am a poor proxy for sending messages to Robert S!
As suggested by Nicola, I am taking it to the list :-)

Good idea :)

Cheers,
Robert

Sebastian Hagedorn

2018-02-15 12:08:27 UTC

Permalink

--On 15. Februar 2018 um 11:20:32 +0100 Robert Stepanek

Post by Robert Stepanek

Post by Sebastian Hagedorn
^Simon^: Is that the first 4Mb of the text/html and/or text/plain parts,
or first 4Mb of the entire message body, ignoring any mime parts?

This limit defines the maximum byte length per MIME body-part of type
"text". The byte length is calculated after decoding (e.g.
quoted-printable), conversion to UTF-8 and search text normalisation
(e.g. stripping HTML tags, replacing Umlaut characters with their ASCII
counterparts, etc.). Actually, it also applies to any other
search-indexed fields, such as subjects, headers, etc. but in practice
only is relevant for mail bodies.

Thanks. I suppose in practice that is good enoughâ¢ïž

While we're at it, maybe you can answer some other questions regarding
Xapian?

Is the setting "search_skipdiacrit" in imapd.conf honored during the
indexing or is that only relevant while searching? Given your comment
regarding search normalization above I take it Umlaut characters are not
considered diacriticals? It's not a huge issue, but as a German university
it would be nice for our users if a search could distinguish between
"hatte" and "hÃ€tte", as an example.

Just out of curiosity, how is the mapping between a Xapian docid and a
message file on disk achieved? I played around with xapian-delve and the
Perl example simplesearch.pl. When I search a term, I get a list of
docid's, but how do I know which message that is?

Cheers
Sebastian

--
.:.Sebastian Hagedorn - Weyertal 121 (GebÃ€ude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
.:.UniversitÃ€t zu KÃ¶ln / Cologne University - â +49-221-470-89578.:.

Robert Stepanek

2018-02-15 15:12:23 UTC

Permalink

Post by Sebastian Hagedorn
Is the setting "search_skipdiacrit" in imapd.conf honored during the
indexing or is that only relevant while searching? Given your comment
regarding search normalization above I take it Umlaut characters are not
considered diacriticals? It's not a huge issue, but as a German university
it would be nice for our users if a search could distinguish between
"hatte" and "hätte", as an example.

Cyrus considers Umlaut characters as diacriticals (I was just handwaving that away in my previous comment due to the default settings). The skip_diacrit setting applies to both indexing and search.

As an example, let's append two emails to a mailbox. The body of message 1 contains the German verb "gären". Message 2 contains the verb "garen" (for the non-German speakers: these verbs mean two different things).

With skip_diacrit set to true (the default), this is what lands in the Xapian database:

[...] Zgaren garen

and searches for "garen" and "gären" will both match both messages.

With skip_diacrit set to false, however, we get

[...] Zgaren Zgären garen gären

and searches for "garen" and "gären" will only match the respective messages.

I uploaded a new test to Cassandane that demonstrates this [1] (the subject_isutf8 test case might also be of interest). I'd just deactivate search_skipdiacrit if you are sure that your users will benefit from it. If in doubt, I would rather err on the safe side and return false positives by skipping diacritics (the default).

There's more to say about the Z prefixes: Cyrus currently uses the English stemmer for all text, resulting in stem terms that typically match their non-stemmed original input for non-English text. While this might seem odd, it's the best we can do without proper language detection for both indexing and search. I implemented multi-language stem support in an experimental feature branch, but didn't resolve the issues around fingerprinting search queries, yet. There's an open issue to track this [2].

[1] https://github.com/cyrusimap/cassandane/blob/master/Cassandane/Cyrus/SearchFuzzy.pm#L403
[2] https://github.com/cyrusimap/cyrus-imapd/issues/72

Post by Sebastian Hagedorn
Just out of curiosity, how is the mapping between a Xapian docid and a
message file on disk achieved? I played around with xapian-delve and the
Perl example simplesearch.pl. When I search a term, I get a list of
docid's, but how do I know which message that is?

In 3.x, Cyrus search stores an internal unique message id, called guid, as docid in Xapian. The guid currently is a SHA-1 hash of the raw message, allowing for deduplication and to avoid re-indexing already seen messages. The conversations.db of a user maps this guid to a list of mailbox:UID pairs.

Off the top of my head, there currently isn't an "official" way in Cyrus to retrieve the mailbox:UID list for a given guid outside the Cyrus process. Depending on your use case, you could either: 1.) build your custom mapper on imap/conversations.h, 2.) use cvt_cyrusdb to dump the contents of a conversations.db into plain text. Or 3.) use the JMAP layer to fetch JMAP-formatted message or the raw message blob by id. For JMAP email, use the guid and prefix it with 'M' in an Email/get method. For blobs, use 'G' as prefix. Both are "unofficial": we might change the JMAP id scheme in future releases. But I guess this isn't going to happen any time soon, if ever.

Hope it helps,
Robert
----
Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
To Unsubscribe:
https://lists.andrew.cmu.edu/ma

Sebastian Hagedorn

2018-02-20 10:08:42 UTC

Permalink

Thanks for your reply, that was very interesting and helpful!

--On 15. Februar 2018 um 16:12:23 +0100 Robert Stepanek

Post by Robert Stepanek

FWIW, that conversion is so "lossy" as to be useless. But it was really
only curiosity, so it doesn't matter.

Cheers,
Sebastian

--
.:.Sebastian Hagedorn - Weyertal 121 (GebÃ€ude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
.:.UniversitÃ€t zu KÃ¶ln / Cologne University - â +49-221-470-89578.:.