Missing timeout for the CAPABILITY command #278

denzelem · 2024-05-03T12:57:25Z

Version: net-imap (0.4.8)

Ruby: 3.2.2

Description

I'm using net-imap for fetching emails in some background jobs. There are rare cases where the email provider Microsoft Exchange returns no or an unexpected answer for the CAPABILITY command. This blocks my background jobs for a long time, sometimes more than 30 minutes. At some point of time a Errno::ECONNRESET (Connection reset by peer) exception is thrown, I assume when Microsoft Exchange closes the connection.

Backtrace

/home/foo/.rbenv/versions/3.2.2/lib/ruby/3.2.0/monitor.rb:108:in `sleep'
/home/foo/.rbenv/versions/3.2.2/lib/ruby/3.2.0/monitor.rb:108:in `wait'
/home/foo/.rbenv/versions/3.2.2/lib/ruby/3.2.0/monitor.rb:108:in `wait_for_cond'
/home/foo/.rbenv/versions/3.2.2/lib/ruby/3.2.0/monitor.rb:108:in `wait'
/var/www/foo/shared/bundle/ruby/3.2.0/gems/net-imap-0.4.8/lib/net/imap.rb:2701:in `get_tagged_response'
/var/www/foo/shared/bundle/ruby/3.2.0/gems/net-imap-0.4.8/lib/net/imap.rb:2794:in `block in send_command'
/home/foo/.rbenv/versions/3.2.2/lib/ruby/3.2.0/monitor.rb:202:in `synchronize'
/home/foo/.rbenv/versions/3.2.2/lib/ruby/3.2.0/monitor.rb:202:in `mon_synchronize'
/var/www/foo/shared/bundle/ruby/3.2.0/gems/net-imap-0.4.8/lib/net/imap.rb:2776:in `send_command'
/var/www/foo/shared/bundle/ruby/3.2.0/gems/net-imap-0.4.8/lib/net/imap.rb:1077:in `block in capability'
/home/foo/.rbenv/versions/3.2.2/lib/ruby/3.2.0/monitor.rb:202:in `synchronize'
/home/foo/.rbenv/versions/3.2.2/lib/ruby/3.2.0/monitor.rb:202:in `mon_synchronize'
/var/www/foo/shared/bundle/ruby/3.2.0/gems/net-imap-0.4.8/lib/net/imap.rb:1076:in `capability'
/var/www/foo/shared/bundle/ruby/3.2.0/gems/net-imap-0.4.8/lib/net/imap.rb:987:in `capabilities'
/var/www/foo/shared/bundle/ruby/3.2.0/gems/net-imap-0.4.8/lib/net/imap.rb:972:in `capable?'
/var/www/foo/shared/bundle/ruby/3.2.0/gems/net-imap-0.4.8/lib/net/imap.rb:1310:in `authenticate'

Feature request

Does it make sense, that one can configure a global timeout or a timeout for individual actions (e.g. https://github.com/ruby/net-imap/blob/v0.4.8/lib/net/imap.rb#L2689) within this gem? Or do you recommend e.g. wrapping gem code into own Timeout blocks?

Example

require 'ostruct'
require 'net/imap'
require 'socket'
require 'active_support/all'

class MailServer
  CRLF = "\r\n".freeze

  def initialize(port)
    @server = TCPServer.new(port)
    puts "Mail server started on port #{port}"
  end

  def start
    loop do
      Thread.start(@server.accept) do |client|
        puts "Connection established with #{client.peeraddr[2]}"

        client.puts "* OK The Microsoft Exchange IMAP4 service is ready.#{CRLF}"

        loop do
          request = client.gets&.chomp
          break if request.blank?

          case request
          when 'RUBY0001 CAPABILITY'
            client.puts "* CAPABILITY IMAP4 IMAP4rev1 AUTH=PLAIN AUTH=XOAUTH2 SASL-IR UIDPLUS ID UNSELECT CHILDREN IDLE NAMESPACE LITERAL+#{CRLF}"
            client.puts "RUBY0001 OK CAPABILITY completed.#{CRLF}"
          when 'RUBY0002 AUTHENTICATE XOAUTH2 dXNlcj11c2VyAWF1dGg9QmVhcmVyIHRlc3QxMjMBAQ=='
            client.puts "RUBY0002 OK AUTHENTICATE completed.#{CRLF}"
          else
            client.puts "* ERR Unknown command #{request}" + CRLF
          end
        end

        client.close
        puts "Connection closed with #{client.peeraddr[2]}"
      end
    end
  end

  def stop
    @server&.close
    puts 'Mail server stopped'
  end
end

server = MailServer.new(2000)
Thread.new { server.start }
sleep 1

Net::IMAP.new('localhost', port: 2000).authenticate('XOAUTH2', 'user', 'test123')

Dropping the line client.puts "RUBY0001 OK CAPABILITY completed.#{CRLF}" will block the script for unlimited time.

The text was updated successfully, but these errors were encountered:

nevans · 2024-05-03T17:59:03Z

Not responding to CAPABILITY, one of the simplest and most important of all IMAP commands, something that should require no more than scanning a configuration setting... Great job, Microsoft! I'd bet the server is misconfigured and/or needs an upgrade. Oh well, you'll see all sorts of weirdness when you work with enough different servers.

Yes, it makes sense to add global read timeouts, but we can't just set a read_timeout, use that on every read, and call it a day like most other simple request/response gems. IMAP connections tend to be long lived, and it's common for some responses to take much longer than others and for some responses (IDLE) to just sit on a connection, waiting to read data, for thirty minutes or so. I would like to introduce optional command timeouts, but we need to be careful about it. I doubt that it's safe to add a simple timeout to get_tagged_response.
Until then, adding Timeout is a reasonable approach. I'd personally consider adding a longer timeout for the entire job (e.g. maybe 15 seconds or a few minutes, depending on the job) and a localized timeout for just for authentication. It generally doesn't take more than a couple of seconds to login, but Microsoft Exchange (and Microsoft 365) can behave pretty weirdly sometimes... still, if it takes more than 10 seconds or so, something's wrong.
For this particular situation though, there is a simpler solution: you can send the sasl_ir: false keyword argument to Net::IMAP#authenticate. SASL-IR is only used when the server explicitly reports the AUTH=mechanism capability, to avoid sending sensitive data that the server can't even use, and that's why we need to ask for the server capabilities. I don't think that's an explicit part of the IMAP SASL-IR spec, but IMO it's implied. And several other protocols do explicitly specify this behavior for their SASL implementation. Using sasl_ir: false will slightly slow down authentication to well-behaved servers because 1) they already sent their CAPABILITY list as part of the server greeting, and 2) it requires waiting the AUTHENTICATE command to do an extra round-trip. But, compared to the slowdown caused by this issue, that's not so bad. And you could special case that just for servers you identify as misbehaving.
If milliseconds matter and you "need" the lowest latency, you could cheat and manually set the @capabilities ivar for Exchange server clients. That's a total hack though, and will probably bite you (or your successor) someday far in the future, after you've long forgotten why you thought it was a good idea. 😉 The capabilities cache ivar will be reset after a successful authentication (capabilities usually change after authentication).

I'm going to leave this ticket open for the first point (above), at least until we have another timeout-related issue or PR to replace it. I would accept a PR to add read timeouts, although I suspect we may need to iterate on the design before it's safe to merge. Since we'll want to iterate on it anyway, a simple naive version could be a good first draft. It might be something to add but leave turned off for 0.5.x and wait until 0.6.x before we enable read timeouts by default.

denzelem · 2024-05-22T09:17:02Z

Thanks @nevans, I can prepare a PoC PR for this issue. I will take approx. 6 weeks until I can start with this task (just for a rough timeline).

nevans · 2024-05-22T12:38:12Z

@denzelem That would be great! I probably won't be working on it any sooner myself. But I should have an updated Net::IMAP.config mechanism in place for v0.5 before then.

wnm · 2024-11-26T10:20:43Z

We have an application that lets users sync emails from different email providers back to our app. Yesterday, office 365 had some server issues, and we had lots of background jobs that were running for 30+ minutes, blocking the whole queue.

While researching a solution, I stumbled upon this issue.

We tracked down where jobs would get stuck with log statements, and realized that the jobs got stuck while authenticating:

# this didn't respond for 30+ seconds:
imap.authenticate(
        'XOAUTH2',
        (imap_endpoint.shared_mailbox_username.presence || imap_endpoint.username),
        imap_endpoint&.microsoft_graph_token&.valid_access_token
      )

Given your second point, we added a specific timeout for the authentication part:

def connect!
  if imap_endpoint.office_365?
    imap.authenticate(
      'XOAUTH2',
      (imap_endpoint.shared_mailbox_username.presence || imap_endpoint.username),
      imap_endpoint&.microsoft_graph_token&.valid_access_token
    )
  else
    imap.login(imap_endpoint.username, imap_endpoint.password)
  end
end


def with_connection(&block)
  Timeout.timeout(LOGIN_TIMEOUT) do
    connect!
  end
  Timeout.timeout(IMAP_TIMEOUT) do
    raise "disconnected" if disconnected?
    yield block
  end
ensure
  disconnect!
end

Our current problem is that those timeouts are not working. The jobs still run for 30+ minutes, being stuck waiting for a response during authentication. Any ideas on how to proceed from here? Thanks in advance for any input!

nevans · 2024-11-26T12:05:05Z

@wnm I don't know why Timeout isn't working there... it should.

You could try setting sasl_ir: false for MS365. You can set that with a kwarg to #authenticate or on the client config.

Now that net-imap has a client config (#288), I'm also open to PRs:

add :when_capabilities_cached option to sasl_ir config. This can behave identically to config.enforce_logindisabled.
authentication read timeout: Since there can be very different expectations per command type, rather than use a single "read_timeout" for all commands it's probably best to add a few different types of read timeout. In this case, I think it's reasonable to use a single "authentication_timeout" for both #login and #authenticate. The implementation is made a bit more complicated by the fact that the socket reads happen on a separate thread.

denzelem · 2024-11-27T08:11:19Z

@nevans I need to admin that I can not make a PR for this issue anymore, sorry. Since we have other issues with the IMAP endpoint of Microsoft Exchange, we are trying to use the Microsoft Graph-API instead. Just a few examples of the issues, if someone else reads this issue:

Net::IMAP::NoResponseError AUTHENTICATE failed. when trying to login too often (we already have some logic to login only n-times per minute, which mitigates the issue for most cases but not all)
Mails are send twice, when we retry on a Net::ReadTimeout with #<TCPSocket:(closed)> (Net::ReadTimeout) error (Microsoft Exchange sends the email, but since our connection breaks, we don't know the actual state)
The endless running logins from this issue

nevans · 2024-11-27T16:50:55Z

@denzelem No prob. Thanks for documenting the issue!

It's good you have some built in rate limiting. In my experience, rate limiting and exponential backoff is eventually a necessity for any communication with remote servers (especially when you retry after failure).

I'm curious about your "mails are send twice". Is this because you are attempting to append to the "Sent" folder after sending via SMTP?

The demographics of my company's customer base has never been enough to warrant adding special-cased API access (except for Google, but GMail's IMAP is mostly well behaved... compared to the other big email Providers). Of course, that's a feedback loop problem: if there aren't enough customers to warrant special treatment, you don't implement the workarounds that would get you more of those customers!

Also, while I do manage some simple background jobs that make quick connections for small tasks, most of my deployed IMAP connections are long-lived and still use EventMachine(!!!). That has some significant pain points of its own, but it does mean that I'm not usually affected much by one connection screwing up any of the others (...except when we've accidentally retried a failed task without rate limiting and backoff and it gets into an infinite loop, thus hogging CPU and spewing GBs of logs 😉).

I long ago just relegated issues like this to the category of "those goofy servers are being goofy again!". 😉 And I have a small list in my head (and sometimes encoded in workarounds) of "yahoo is weird this way, outlook.com is weird that way, MS365 does that wack thing, go daddy hosted IMAP is weird the other way, icloud is crazy for that thing, etc, etc". But my memory is very faulty, and I would like a slightly more methodical classification of which servers are goofy and in what ways are they goofy. I haven't seen a wiki anywhere with these sorts of issues on it... should we start one? (here?) I've been slowly working towards moving our per-server workarounds into a basic strategy-pattern set of classes, but it hasn't been a high priority.

We can certainly do more in net-imap (either documentation or code) to reduce these pain points. Right now, my available time is spent trying to get ESEARCH/UIDONLY/QRESYNC support finalized in time for the ruby 3.4.0 release. But I would like to add comprehensive timeouts in time for v0.6.0. 🤞🏻

nevans mentioned this issue Jun 13, 2024

🔧 Add config option for sasl_ir #294

Merged

nevans added this to the v0.6 milestone Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing timeout for the CAPABILITY command #278

Missing timeout for the CAPABILITY command #278

denzelem commented May 3, 2024

nevans commented May 3, 2024 •

edited

Loading

denzelem commented May 22, 2024

nevans commented May 22, 2024

wnm commented Nov 26, 2024 •

edited

Loading

nevans commented Nov 26, 2024

denzelem commented Nov 27, 2024

nevans commented Nov 27, 2024 •

edited

Loading

Missing timeout for the CAPABILITY command #278

Missing timeout for the CAPABILITY command #278

Comments

denzelem commented May 3, 2024

Description

Backtrace

Feature request

Example

nevans commented May 3, 2024 • edited Loading

denzelem commented May 22, 2024

nevans commented May 22, 2024

wnm commented Nov 26, 2024 • edited Loading

nevans commented Nov 26, 2024

denzelem commented Nov 27, 2024

nevans commented Nov 27, 2024 • edited Loading

nevans commented May 3, 2024 •

edited

Loading

wnm commented Nov 26, 2024 •

edited

Loading

nevans commented Nov 27, 2024 •

edited

Loading