Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong charset used in creating search terms when server supports UTF8 #131

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

jmehrens
Copy link
Contributor

#474
Signed-off-by: jmehrens [email protected]

@jmehrens jmehrens added the bug Something isn't working label Feb 15, 2024
@jmehrens jmehrens requested a review from lukasj February 15, 2024 22:12
@jmehrens jmehrens self-assigned this Feb 15, 2024
@jmehrens jmehrens marked this pull request as draft February 15, 2024 22:12
@jmehrens
Copy link
Contributor Author

jmehrens commented Feb 15, 2024

@lukasj @jbescos I'm trying to take a look at these high value tickets and get them fixed but I need a input on this one before I proceed.

I updated test where I print the command string followed by the hex values of each char for some validation of this patch. Test code is in the PR.

If I run the old code with the new test I see the following output:

B3 SEARCH SUBJECT "���"
0x42, 0x33, 0x20, 0x53, 0x45, 0x41, 0x52, 0x43, 0x48, 0x20, 0x53, 0x55, 0x42, 0x4a, 0x45, 0x43, 0x54, 0x20, 0x22, 0x19, 0xfffd, 0xfffd, 0x22, 
jakarta.mail.MessagingException: B3 BAD UTF-8 encoding/decoding not used;

Looks like a problem where UTF-8 was not used to encode/decode the bytes. Ok, fair enough.

With the full patch proposed, which is very similar to jakartaee/mail-api#607 and #104

The output is:

B3 SEARCH SUBJECT "’系统"
0x42, 0x33, 0x20, 0x53, 0x45, 0x41, 0x52, 0x43, 0x48, 0x20, 0x53, 0x55, 0x42, 0x4a, 0x45, 0x43, 0x54, 0x20, 0x22, 0x2019, 0x7cfb, 0x7edf, 0x22, 

Looks much better in that we are not seeing garbage characters. Great!

However, RFC6855 states:

The IMAP base specification [RFC3501] forbids the use of 8-bit
characters in atoms or quoted-strings. Thus, a UTF-8 string can only
be sent as a literal.

RFC3501 Section 4.3. String, states that a literal is:

A literal is a sequence of zero or more octets (including CR and
LF), prefix-quoted with an octet count in the form of an open
brace ("{"), the number of octets, close brace ("}"), and CRLF.

[snip]

A quoted string is a sequence of zero or more 7-bit characters,
excluding CR and LF, with double quote (<">) characters at each
end.

If I'm understand the RFC correctly matched with the output above, this current patch is generating a 8-bit characters in a quoted-string which is incorrect. Therefore will simply produce a similar issue to jakartaee/mail-api#526. Yep, that is 3 tickets I found so far, not including this one, surrounding the same core issue.

If I'm on track then I think what I need to do is (fix my branch name) have the SearchSequence class query this.protocol.suportsUtf8() method and generate literals instead of quoted strings when needed. Otherwise, use old behavior.

Is my analysis on track here or am I off? Does this look like a reasonable approach?

@jmehrens jmehrens requested a review from jbescos February 15, 2024 22:40
@jbescos
Copy link
Member

jbescos commented Feb 20, 2024

@lukasj @jbescos I'm trying to take a look at these high value tickets and get them fixed but I need a input on this one before I proceed.

I updated test where I print the command string followed by the hex values of each char for some validation of this patch. Test code is in the PR.

If I run the old code with the new test I see the following output:

B3 SEARCH SUBJECT "���"
0x42, 0x33, 0x20, 0x53, 0x45, 0x41, 0x52, 0x43, 0x48, 0x20, 0x53, 0x55, 0x42, 0x4a, 0x45, 0x43, 0x54, 0x20, 0x22, 0x19, 0xfffd, 0xfffd, 0x22, 
jakarta.mail.MessagingException: B3 BAD UTF-8 encoding/decoding not used;

Looks like a problem where UTF-8 was not used to encode/decode the bytes. Ok, fair enough.

With the full patch proposed, which is very similar to jakartaee/mail-api#607 and #104

The output is:

B3 SEARCH SUBJECT "’系统"
0x42, 0x33, 0x20, 0x53, 0x45, 0x41, 0x52, 0x43, 0x48, 0x20, 0x53, 0x55, 0x42, 0x4a, 0x45, 0x43, 0x54, 0x20, 0x22, 0x2019, 0x7cfb, 0x7edf, 0x22, 

Looks much better in that we are not seeing garbage characters. Great!

However, RFC6855 states:

The IMAP base specification [RFC3501] forbids the use of 8-bit
characters in atoms or quoted-strings. Thus, a UTF-8 string can only
be sent as a literal.

RFC3501 Section 4.3. String, states that a literal is:

A literal is a sequence of zero or more octets (including CR and
LF), prefix-quoted with an octet count in the form of an open
brace ("{"), the number of octets, close brace ("}"), and CRLF.

[snip]

A quoted string is a sequence of zero or more 7-bit characters,
excluding CR and LF, with double quote (<">) characters at each
end.

If I'm understand the RFC correctly matched with the output above, this current patch is generating a 8-bit characters in a quoted-string which is incorrect. Therefore will simply produce a similar issue to jakartaee/mail-api#526. Yep, that is 3 tickets I found so far, not including this one, surrounding the same core issue.

If I'm on track then I think what I need to do is (fix my branch name) have the SearchSequence class query this.protocol.suportsUtf8() method and generate literals instead of quoted strings when needed. Otherwise, use old behavior.

Is my analysis on track here or am I off? Does this look like a reasonable approach?

I also understand it in the same way as you do, but it is probably a good idea to test it with an IMAP server to make sure it works. In case you need a local IMAP server you can use this one.

#474
Signed-off-by: jmehrens [email protected]
@jmehrens
Copy link
Contributor Author

I also understand it in the same way as you do, but it is probably a good idea to test it with an IMAP server to make sure it works. In case you need a local IMAP server you can use this one.

Awesome! Thanks! This patch will take some time to get working correctly with regards to coding and testing

@jmehrens
Copy link
Contributor Author

jmehrens commented Mar 4, 2024

@synim503 I pushed my last changes which are just an incomplete set of changes.

Here is where this is at so you are aware of what you'll run into:

  1. The format and sizing of the org.eclipse.angus.mail.imap.protocol.SearchSequence.Utf8Literal is not tested nor can I vouch for correctness. Only run this in a test environment. This was a work in progress.
  2. Still working on correct approach for passing of the charset used to encode the bytes. When server supportsUtf8 I would the client should be set to UTF-8 but, I'm still working on what to do if the client is a subset of UTF-8. This is TBD so set the client to UTF-8 for testing.
  3. This change breaks the search unit test because the line parsing is only setup to parse quoted text and not literals. The test is commented out so the build works.

@synim503
Copy link

synim503 commented Mar 7, 2024

@jmehrens, I don't have a deep knowledge of programming. But according to a preliminary test this method works well for Russian. It finds letters well both by subject and body. The only thing that I have noticed so far is that the search problems occur with the ' symbol

@jmehrens
Copy link
Contributor Author

jmehrens commented Mar 7, 2024

@synim503 Thank you for trying out that build. When you tested with ', was it at the start or end of search text? Would you be able to attach session debug logs if your data is not sensitive?

Signed-off-by: jmehrens [email protected]
@jmehrens
Copy link
Contributor Author

jmehrens commented Apr 2, 2024

Testing with imap-mail.outlook.com the server doesn't advertise UTF8=ACCEPT

Apr 02, 2024 1:03:20 AM org.eclipse.angus.mail.imap.IMAPStore protocolConnect
FINE: trying to connect to host "imap-mail.outlook.com", port 993, isSSL true
Apr 02, 2024 1:03:23 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: * OK The Microsoft Exchange IMAP4 service is ready. [snip]
Apr 02, 2024 1:03:23 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: A0 CAPABILITY
Apr 02, 2024 1:03:24 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: * CAPABILITY IMAP4 IMAP4rev1 AUTH=PLAIN AUTH=XOAUTH2 SASL-IR UIDPLUS MOVE ID UNSELECT CHILDREN IDLE NAMESPACE LITERAL+
Apr 02, 2024 1:03:24 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: A0 OK CAPABILITY completed.

With -Dmail.imap.throwsearchexception=false the search works by matching on the client side without any issues other than it is slow and doesn't test this change.

I did another test were I've set -Dmail.imap.throwsearchexception=true -Dmail.mime.charset=US-ASCII and modified the search charsets to include US-ASCII as the last search option:

Apr 02, 2024 1:03:25 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: A4 SEARCH CHARSET UTF-8 SUBJECT {13+}
Apr 02, 2024 1:03:25 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: Comunicação ALL
Apr 02, 2024 1:03:25 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: A4 NO [BADCHARSET (US-ASCII)] The specified charset is not supported.
Apr 02, 2024 1:03:25 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: A5 SEARCH CHARSET ISO-8859-1 SUBJECT {11+}
Apr 02, 2024 1:03:25 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: Comunica��o ALL
Apr 02, 2024 1:03:26 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: A5 NO [BADCHARSET (US-ASCII)] The specified charset is not supported.
Apr 02, 2024 1:03:26 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: A6 SEARCH CHARSET US-ASCII SUBJECT {11+}
Apr 02, 2024 1:03:26 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: Comunica��o ALL
Apr 02, 2024 1:03:26 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: * SEARCH
Apr 02, 2024 1:03:26 AM org.eclipse.angus.mail.util.LogOutputStream log
FINEST: A6 OK SEARCH completed.

As shown in the output JakartaMail converts US-ASCII to ISO-8859-1 but outlook365 expects US-ASCII. So there is a new bug ticket here were we are unable to force US-ASCII as a search char set to support outlook365.

There is a possible enhancement where we could test for mostly ascii and combine imap search on an ascii prefix/suffix then client side filter on the full utf-8 text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants