Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

irc.hashbang.sh geoDNS setup is unreliable #70

Open
2 tasks
KellerFuchs opened this issue Jul 3, 2017 · 21 comments
Open
2 tasks

irc.hashbang.sh geoDNS setup is unreliable #70

KellerFuchs opened this issue Jul 3, 2017 · 21 comments
Assignees

Comments

@KellerFuchs
Copy link
Member

KellerFuchs commented Jul 3, 2017

We currently have an outage where lon1.irc.hashbang.sh fails all TLS handshakes.
All users in Europe are only sent a record for lon1.

  • The health check should actually perform a TLS handshake and validate the cert, if AWS can even do that.
  • We should, longer-term, make sure that all users get at least 2 IRCds in any request.
@KellerFuchs
Copy link
Member Author

PS: That would be way easier if IRC had SRV support, but if wishes were fishes, ...

@mayli
Copy link

mayli commented Aug 16, 2017

Could we respond with both irc servers sorted by geoDNS? Dunno if it's available or not.

@KellerFuchs
Copy link
Member Author

@mayli Not sure what you mean by “sorted” in that case.

@necrophcodr
Copy link
Contributor

necrophcodr commented Aug 18, 2017

You don't "sort" DNS anything. You just respond with what is needed. With DNS it's easy, because the geo-part is built in. Servers from EU only respond with European servers, US DNS only responds with US servers, simple as that. Using Route53 I'm sure allows for this?

What I mean is, why not just ask for irc.hashbang.sh, and let that entry be different depending on the geographical DNS locations. This way, a server from EU will respond faster in EU than a US server will in EU, hence only the EU entries are used by people located there.

@RyanSquared
Copy link
Member

RyanSquared commented Aug 18, 2017

What I mean is, why not just ask for irc.hashbang.sh, and let that entry be different depending on the geographical DNS locations. This way, a server from EU will respond faster in EU than a US server will in EU, hence only the EU entries are used by people located there.

That is what we currently do; however, we've come up with a few issues because of this, as was pointed out above. If a TLS certificate is invalid, the server must be removed from the DNS queriesreplies. If the server isn't actually alive, it also must be removed from the DNS queriesreplies.

@mayli
Copy link

mayli commented Aug 18, 2017

@KellerFuchs by "sorted" I mean, entries in DNS respond has an sorted "answer field" eg.
order
In most cases dns server are implemented to return them in arbitrary order to have some kind of DNS level load balance.

In the client side, it usually will try connect each entry by the order in the response.
With those two combined, we could have a DNS level HA. The faster server is primary and slow server is backup.

@KellerFuchs
Copy link
Member Author

why not just ask for irc.hashbang.sh, and let that entry be different depending on the geographical DNS locations

That was exactly what was in place.
The issue was that the healthcheck, which was there to avoid sending users to a broken server, only checked that a TCP connections could be established; of course, when TLS broke, irc.hashbang.sh was suddenly broken for all Europe...

@KellerFuchs
Copy link
Member Author

@mayli Except that ressource records are not ordered, or rather, quoting RFC 1034, 3.6, “the order of RRs in a set is not significant, and need not be preserved by name servers, resolvers, or other parts of the DNS”. In practice, many DNS resolvers randomize the order in a RRset, to prevent broken clients (cough Windows cough) from always hitting the “first” server.

The correct way to implement that would be SRV records (RFC 2782), but of course that's not a thing for IRC...

@RyanSquared
Copy link
Member

RyanSquared commented Sep 10, 2017

14:27 <LordRyan> hey Habbie you work with PowerDNS right?
14:27 <Habbie> i do
14:27 <LordRyan> damn that was fast
14:28 <LordRyan> if I wanted to have a GeoIP-based domain with live health checks, what would be the best way to do that?
14:28 <LordRyan> Could I pack in cqueues and use cqueues in a checking mechanism?
14:29 <Habbie> in the auth luabackend you mean?
14:29 <LordRyan> well i'm honestly not sure how lua integrates into it, but i'd assume so yes
14:29 <Habbie> assuming this is auth, your options are
14:29 <Habbie> - luabackend
14:29 <Habbie> - pipebackend
14:29 <Habbie> - remotebackend
14:30 <Habbie> luabackend has actual Lua states inside powerdns, and absolutely nothing happens in them except when a query comes in
14:30 <Habbie> which is not where you want to do your health checks because somebody is waiting for an answer
14:30 <Habbie> pipebackend and remotebackend integrate over pipes/sockets using either a simple line-based protocol or JSON (inside HTTP depending on choices you make)
14:30 <Habbie> in which case your end can do whatever the hell it wants as long as it responds over the socket
14:30 <LordRyan> hm. alrighty.
14:31 <LordRyan> so PowerDNS kinda acts like a frontend and then I can use a backend to form a response in the form of a Lua server?
14:31 <Habbie> yes
14:31 <Habbie> and you have to follow a few very simple rules
14:31 <Habbie> and powerdns will get all the DNS pain exactly right for you
14:31 <LordRyan> I can do a cqueues async loop where the healthcheck runs every minute and still be able to send data across the socket
14:31 <LordRyan> awesome 👍
14:31 <Habbie> yes, that sounds good
14:32 <LordRyan> so is it possible to set up this backend for just one subdomain, or would it apply for all to go to this backend?
14:32 <Habbie> the best short answer is 'run a separate pdns_server for this and put dnsdist in front to route queries'
14:33 <LordRyan> alrighty
14:33 <LordRyan> thank you for your time
14:33 <Habbie> using multiple backends in a single pdns_server is thorny, behaviour tends to subtly change between versions, so we don't recommend it
14:33 <Habbie> no problem
14:33 <Habbie> if you have more questions further down the road, OFTC #powerdns is welcoming and is not just me :)

So, currently the best solution is:

  • Use dnsdist to redirect traffic according to two servers:
    • (.+\.)?irc.hashbang.sh redirects to a PowerDNS server, which redirects to a cqueues-based healthcheck and response server
    • everything else redirects to another server serving static responses

@RyanSquared
Copy link
Member

@KellerFuchs
Copy link
Member Author

KellerFuchs commented Sep 10, 2017

FYI, using PowerDNS for GeoDNS means that we point everyone at our own DNS server, which isn't great for latency or reliability.

OTOH, AWS supports SSL healthchecks, which ought to be enough.

@RyanSquared
Copy link
Member

OTOH, AWS supports SSL healthchecks, which ought to be enough.

But it also means relying on AWS. I figured we were hopefully going for something more "independent"? Testing such things on a local system would be harder without a builtin DNS setup.

@KellerFuchs
Copy link
Member Author

Nevermind, it seems AWS supports SSL-based heathchecks for ELB but not for Route53.
What the actual fuck. :O

@KellerFuchs
Copy link
Member Author

KellerFuchs commented Sep 11, 2017

@RyanSquared In principle, I would love us to run our own DNS infra.
However, that basically means relying on 3rd-party services for replicas, for reliability & latency reasons (I don't happen to have an anycast DNS network in my backpocket... yet :P) and the standard ways of doing that don't support GeoDNS (because that's not something standardized).

As far as I can tell, we can pick 2 out of 3 from:

  • selfhosted (at least for the primary replica & zone signing);
  • highly-available and low-latency;
  • GeoDNS support.

Frankly, I would be quite OK dropping GeoDNS in favor of the first two, esp. given how limited Route53's builtin healthchecks are, but that definitely would be a longer-term project.
Also, it would need to be discussed with the other admins, and I don't feel that's a discussion that belongs in this issue.

@mayli
Copy link

mayli commented Sep 12, 2017

@KellerFuchs how freenode solve this problem?

@KellerFuchs
Copy link
Member Author

By not doing GeoDNS.

@RyanSquared
Copy link
Member

@mayli Freenode, Esper, and many other servers just have a set of records that point to all their servers, independent of location. If users have an issue, it is recommended to instead set your client to a server (or to select from a list of servers) that works best for the user.

Not all servers might be listed (at the same time, or even in general) on the public interface, though. However, for our setup, it should be fine to just list them all. Plus, nothing against Freenode, but until recently their network management has been a bit clunky.

@mayli
Copy link

mayli commented Sep 14, 2017

So, can we return all records as well? This seems the simple & stupid solution that works without too much effort. And we'd better use our bandwidth to focus on more important stuff, like userdb and other things.

@RyanSquared
Copy link
Member

Yes. That is the "default" way most DNS servers return multiple results for one name.

@KellerFuchs
Copy link
Member Author

Yes, endless discussion about a thing that is currently a non-issue is indeed consuming bandwidth...

@RyanSquared
Copy link
Member

In order to close this issue - is the DNS setup in general still an issue? If we add a server are we going to have GeoIP enabled for it? If so, how should we remove this configuration?

@DeviaVir DeviaVir removed their assignment Oct 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants