[01:32:21] Could use some help figuring out if something has regressed or intentionally changed (or I'm misremembering) [01:32:22] https://phabricator.wikimedia.org/T202479 [01:32:42] It seems that garbage hostnames like example.org aren't refused at Varnish, instead they go to app servers. [01:33:25] and IP-like hostnames as well, without TLS redirect like we normally do for cache_text [01:34:36] This means http://103.102.166.224/w/load.php?debug=T202479&404_from_app_server and http://103.102.166.224/static/favicon/wikipedia.ico both work as they would with a wikipedia.org hostnames, insecurely. [01:34:37] T202479: Investigate source of 404 Not Found responses from load.php - https://phabricator.wikimedia.org/T202479 [02:47:25] 10Domains, 10Traffic, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) [03:00:12] 10Domains, 10Traffic, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) Hi @tramm i took your request. I see you want to transfer the domain entirely to Wikimedia Eesti. I'm contacting legal because they handle the domain re... [06:57:15] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Laurentius) >>! In T199252#4575757, @kaldari wrote: > Do we know how many pa... [13:12:01] 10Traffic, 10Operations: certcentral: challenge checking on *all* pooled backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) [13:13:09] 10Domains, 10Traffic, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) > From the SRE/ops point of view we would like to either completely leave it as it is or completely transfer the domain and delete it from all our confi... [13:13:57] 10Traffic, 10Operations: gdnsd plugin support for ACME DNS challenges - https://phabricator.wikimedia.org/T194965 (10Krenair) Status: @bblack has written support into gdnsd in https://github.com/gdnsd/gdnsd/commit/db7fff10b005b951890fa4ff7c843a1e37bbdc58 (as well as a follow up or two) and I've made https://ge... [13:17:09] https://labs.spotify.com/2018/08/31/smoother-streaming-with-bbr/ [13:19:05] 10HTTPS, 10Traffic, 10Operations: letsencrypt puppetization: add parallel rsa+ecdsa cert support - https://phabricator.wikimedia.org/T141266 (10Krenair) I don't know if we're going to end up doing this in the current letsencrypt puppetisation, but it's mostly there certcentral. Only thing is my puppetisation... [13:20:20] gilles: I see some VCL in there :) [13:22:28] 10HTTPS, 10Traffic, 10Operations, 10Patch-For-Review: letsencrypt puppetization: upgrade for scalability - https://phabricator.wikimedia.org/T134447 (10Krenair) are we going to do this as part of the letsencrypt puppetisation or is this getting made (mostly?) obsolete by certcentral? [16:57:31] bblack: ah, I prepped a patch for cp1099 earlier today :) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/459989/ [16:58:34] you mentioned we're not yet ready for applying cache::canary though? [16:59:05] ema: well, I mean we're missing the part where we'll have to invent a new IP for UP in the lvs range, and define an LVS service with cp1099 as the only backend, etc [16:59:20] right [16:59:24] for today I just wanted to get going on dns stuff [16:59:45] also we might want to add profile::base::notifications_enabled: '0' [17:05:47] let's do this instead https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460064/ [17:05:56] it will fix it for all of role(test) [17:06:12] in the same way that is already used for role(spare::system) [17:13:10] +1 [17:19:41] mutante: that affects cp1099.eqiad.wmnet,multatuli.wikimedia.org,ruthenium.eqiad.wmnet,tungsten.eqiad.wmnet [17:19:45] just FYI [17:19:54] 'O:test' in cumin [17:21:56] volans: but would you agree that things labeled "test" should not notify ? [17:22:30] mutante: page ofc, IRC dunno, depends [17:23:04] relying only on seeing them on the Icinga UI might not be neough [17:23:47] technically .. the notification method is all configured with the contact and not the service [17:24:15] i'll wait a bit [17:51:20] 10Traffic, 10MediaWiki-ResourceLoader, 10Operations, 10Performance-Team: Investigate source of 404 Not Found responses from load.php - https://phabricator.wikimedia.org/T202479 (10Krinkle) [18:02:32] https://phabricator.wikimedia.org/P7537 [18:02:46] ^ "gdnsd replace" works on live stretch with a real config :) [18:38:48] <_joe_> how can you do socket takeover and please lennart at the same time? [18:38:56] <_joe_> and kay, obviously [18:39:21] <_joe_> hhvm would greatly benefit from that :P [18:45:55] :) [18:46:59] well, the replacement daemon has to be a descendant of the original daemon to swap things under systemd at all [18:47:23] but it can't be a direct descendant: you have to fork() twice or systemd will also get confused by another issue [18:47:51] fork()ing twice necessitates setting up a communications pipe across the double-fork if you want to track and report the new PID or do further things on it [18:48:38] and of course if the old daemon is going to spawn the new, and things are secure (daemon doesn't run as root, as uses prctl/caps bits to prevent re-escalation) [18:49:43] then the daemon can't do privileged things on startup at all, so you have to move to a model where systemd does all the security/privdrop bits, and sets CAP_NET_BIND_SERVICE if you need it for that privileged listening ports, and the daemon doesn't muck with privilege at all. [18:50:17] and then once you've got all that sorted out, you also need an open Unix Socket between the old and new, to hand off sockets via encoding them in SCM_RIGHTS messages [18:50:40] that's pretty much the core of all the magick, in a nutshell [18:52:51] in our case the open unix socket for handoff is just a connection via the same runtime control socket that "gdnsdctl" uses to send commands to the daemon anyways [18:53:40] and for extra brain-bending, the daemon's controlsock listening socket where it receives such control connects, is also passed to the new daemon via SCM_RIGHTS (over a connection to itself...) [18:54:37] (so that there's never a window where nothing answers gdnsdctl requests on the control socket, although it will deny certain operations and/or delay connecting briefly, at critical points in the handoff) [18:55:29] ("certain operations" mean requests to stop or replace the daemon while it's already busy with a replacement handoff process) [19:19:09] well I guess the other thing to mention, when contemplating doing the above to other daemons: [19:19:41] if the daemon has multiple threads, open filehandles, mutexes, etc, etc... all the things you'd expect of complex software... [19:20:48] it's virtually impossible to just asynchronously decide to do a clean fork()->fork()->exec() of a replacement while everything else is still running. All of the above focuses on just the why and how of the fork/fork/exec+socket-handoff part [19:21:15] but what made it realistic for gdnsd is that the design is very simple, so I was able to impose a lot of other constraints on the code to make it possible. [19:22:18] e.g. if some $random_runtime_thread happened to have a temporary filehandle open, when you fork->fork->exec the child you'll leak a copy of that filehandle into the child [19:22:38] there are numerous pitfalls like that, which is why the best general-purpose advice is "don't fork a threaded program" [19:23:33] for the filehandle/socket leak part of the problem, there is a "standard" way to fix it cleanly, which is to open all filehandles with O_CLOEXEC and all sockets with SOCK_CLOEXEC, so that they're atomically set to auto-close themselves for cleanup on exec() [19:24:24] but in some cases, the relevant syscalls just-recently got SOCK_CLOEXEC or O_CLOEXEC support on various platforms, and may still not have it on older libcs/kernels. POSIX standards have yet to fully catch up even specifying it all. [19:24:32] (which is sad, given how old the problem is) [19:25:08] FreeBSD has a nice writeup on the CLOEXEC problems: https://wiki.freebsd.org/AtomicCloseOnExec [19:25:40] TL;DR being that modern FreeBSD+Linux support it in all the interfaces that need it, but POSIX hasn't released matching standards yet, and older/other platforms YMMV [19:27:35] so the point, it would probably be very difficult to patch the rest of hhvm's threaded runtime to be "safe to randomly fork without leaking", and IIRC there are similar issues around deadlocking mutexes too, but I don't recall clearly. [19:28:12] gdnsd only uses pthread mutex/condwait stuff during its startup sequence, not at runtime. [19:29:05] or you, back when developers like me pleaded with systemd to sanely support existing models of smooth takeover, which didn't involve a descendant of the older process, they could've listened :P [19:46:08] bblack: \o/ for the paste and very nice description, now you'll hate me, but you know you should write a blog post on this ;) [22:24:46] 10netops, 10Operations, 10ops-eqiad: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) p:05Triage>03Normal