[03:39:53] 10Traffic, 10Operations, 10MW-1.32-release-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Legoktm) After those ULS patches, the current status is that MW is setting Vary: Accept-Language un... [04:41:46] vgutierrez: Krenair: err, now that the packaging stuff was removed, what did you want me to review? [04:47:19] anyways, gave it a look over [07:33:08] 10Traffic, 10Operations, 10MW-1.32-release-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Nikerabbit) > The language selectors are generating URLs with ?uselang=XX Why are you not using (j... [10:30:11] vgutierrez, legoktm: So I removed the skipsdist = True from tox.ini [10:30:18] now setup.py can't find README.md [12:51:36] !log trafficserver 7.1.3+ds-4wm3 uploaded to stretch-wikimedia T199720 [12:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:41] T199720: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 [14:30:29] ema: you around? [14:33:09] volans: yup [14:33:22] available for the hacky test on pink unicorn? [14:33:38] sure [14:33:42] if there aren't other things going on on that host [14:34:08] nope, we're good to test there [14:34:20] great, so let me disable puppet via spicerack there [14:34:26] like the switchdc will do [14:34:35] sounds good [14:34:59] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/457850/ rebased [14:35:32] remote.query('A:cp-canary').run_sync('disable-puppet "test switchdc - volans"') [14:35:35] done [14:36:01] volans: ok to merge? [14:36:05] go for it [14:37:11] volans: puppet-merged [14:37:57] ema: ack, preparing the command to enable+run puppet and get the output [14:39:37] ema: proceeding [14:39:41] k [14:39:59] remote_output = remote.query('A:cp-canary').run_sync(ENABLE_COMMAND) [14:40:20] and then I'll use _check_changes defined in the 04-switch-traffic cookbook directly [14:40:39] alright [14:41:09] there are a bunch of errors in the puppet run [14:41:10] expected? [14:41:16] mmh, no [14:41:34] Error: /Stage[main]/Cacheproxy::Instance_pair/Varnish::Instance[text-backend]/Exec[load-new-vcl-file]: Failed to call refresh: /usr/share/varnish/reload-vcl -f /etc/varnish/wikimedia_text-backend.vcl -d 2 -a -s /etc/varnish/wikimedia_misc-backend.vcl || (touch /var/tmp/reload-vcl-failed; false) returned 1 instead of one of [0] [14:41:39] Error: /Stage[main]/Cacheproxy::Instance_pair/Varnish::Instance[text-backend]/Exec[load-new-vcl-file]: /usr/share/varnish/reload-vcl -f /etc/varnish/wikimedia_text-backend.vcl -d 2 -a -s /etc/varnish/wikimedia_misc-backend.vcl || (touch /var/tmp/reload-vcl-failed; false) returned 1 instead of one of [0] [14:41:48] and then at the end [14:41:48] Error: /usr/share/varnish/reload-vcl -f /etc/varnish/wikimedia_text-backend.vcl -d 2 -a -s /etc/varnish/wikimedia_misc-backend.vcl && (rm /var/tmp/reload-vcl-failed; true) returned 1 instead of one of [0] [14:41:52] Error: /Stage[main]/Cacheproxy::Instance_pair/Varnish::Instance[text-backend]/Exec[retry-load-new-vcl-file]/returns: change from notrun to 0 failed: /usr/share/varnish/reload-vcl -f /etc/varnish/wikimedia_text-backend.vcl -d 2 -a -s /etc/varnish/wikimedia_misc-backend.vcl && (rm /var/tmp/reload-vcl-failed; true) returned 1 instead of one of [0] [14:43:19] volans: ah, it's the codfw caches not being defined on cp1008 [14:43:41] if (req.http.X-Wikimedia-Debug) { set req.backend_hint = appservers_debug.backend(); } else { set req.backend_hint = cache_codfw.backend(); set req.http.X-Next-Is-Cache = 1; } [14:43:55] cache_codfw is undefined ^ [14:44:22] ack, hence the failure of finding the codfw output I guess [14:45:42] let me try to get rid of the debug directors on cp1008 as a test [14:45:50] also we're doing it in 1 step [14:45:54] instead of 2 [14:46:03] yeah [14:46:08] I mean in prod we do 1 commint but enabling it selectively in different hosts [14:46:14] here is all-in-one :) [14:46:20] butI can test the revert [14:46:24] and check that codfw matches I guess [14:46:26] sorry [14:46:28] eqiad [14:46:35] sure, please do [14:47:28] ri-disabled puppet, creating CR [14:47:57] https://gerrit.wikimedia.org/r/c/operations/puppet/+/458515 [14:48:22] volans: +1 [14:49:03] merged [14:49:45] running puppet [14:51:53] ema: check succeeded [14:51:59] didn't ask for confirmation [14:52:22] nice [14:52:24] so in theory we should be ok :) [14:53:20] let me know if you want to do any additional tests [14:54:21] volans: at the moment I'm not really sure why the debug directors would try to use cache_codfw backends [14:54:43] volans: however, for testing purposes we can temporarily get rid of those https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458516/ [14:54:46] but that's unrelated to the switch right? [14:55:00] or you're worried it might be [14:55:12] no, I think it's unrelated [14:55:37] I was just saying in case you wanted to try again the eqiad -> codfw switch [15:03:15] I think we're good, the regex are basically identical, but if you want sure [15:03:19] same for me [15:14:25] 10Traffic, 10Operations: Make configurable the cmd executed to perform a DNS zone update - https://phabricator.wikimedia.org/T203678 (10Vgutierrez) p:05Triage>03Normal [18:36:33] vgutierrez: I don't understand what https://phabricator.wikimedia.org/T203678 is even about :P [18:37:49] 10Traffic, 10Operations: certcentral: Make configurable the cmd executed to perform a DNS zone update - https://phabricator.wikimedia.org/T203678 (10Krenair) [18:38:29] paravoid, that'll be about making https://phabricator.wikimedia.org/diffusion/OSCC/browse/master/certcentral/certcentral.py$106 configurable [18:38:58] so s/echo/some_future_script.py/ isn't a code change -> repackaging -> redeploy [18:39:07] is this a change to certcentral or to authdns-*? [18:39:19] it's a change to certcentral [18:39:28] oh ok, until 2 minutes ago that change didn't mention certcentral at all :) [18:39:47] ah but was a subtask of the ACME stuff, ok [18:40:13] I was reading it through my inbox, and didn't understand which cmd it was referring to etc. [18:40:23] thanks :) [18:41:05] from certcentral's pov, it want's to execute: 'some_command.py example.org 20hf3qhueaihfiahfeiahuweifuh www.example.org 21rh0w1hOIEHFOAIEWF' to know that the appropriate DNS zones now answer challenges and it can let LE know to query them. [18:41:28] and gdnsd will have a matching "gdnsdctl acme-dns-01 example.org 20hf3qhueaihfiahfeiahuweifuh www.example.org 21rh0w1hOIEHFOAIEWF" [18:41:44] oh? gdnsd? [18:41:46] the script in question hooks the two up via a list of dns servers to push to and authorized_keys to go execute it on them all, etc [18:41:49] is that a new dynamic plugin or something? [18:42:17] not really a plugin, it's just baked into the core as a special kind of dynamic data [18:43:07] huh [18:43:16] gdnsdctl injects TXT records for the ACME DNS-01 case, and they auto-expire from the runtime after a configurable 10 minutes, and don't require any support from zonefiles (or conflict with concurrent updates to zonefiles while that timer is ticking, etc) [18:43:28] wow [18:43:45] that's new right? [18:43:56] it's not even in gdnsd's proper master branch yet [18:44:01] heh [18:44:23] I'm still finishing up testing and integration of a couple more commits beyond that, before moving it all back to master and calling it to the to-be-3.x beta release or whatever [18:44:33] so how is certcentral going to invoke gdnsdctl? [18:45:12] from certcentral's pov, there's a (configurable per the ticket above) script it executes locally, which makes sure DNS challenges are ready to go. [18:45:29] oh so certcentral will run on the authnses? [18:45:33] that script will be puppetized with config for a list of our authdns server hostnames and authorized_keys to reach them, and will ssh out to them all and execute gdnsdctl [18:45:40] ah [18:45:44] no, certcentral will be separate hosts, like ganeti instances [18:46:55] the script referenced in the ticket will be what SSHes out to the auth DNS machines to run the command [18:47:08] do you envision the transport for this becoming DDNS at some point? [18:47:16] I doubt it [18:47:36] because of the complexities in implementing DDNS/TSIG or..? [18:48:00] that and the lack of other use-cases. In general, gdnsd's model doesn't fit well with the concept fo dynamic updates. [18:48:24] acme-dns-01 injection of temporary results is easy enough to glomp onto the side of things, but I wouldn't want to make it a general mechanism. [18:49:41] ok [18:50:32] if I did put in a ddns protocol implementation just for acme (which sounds like a lot more work and pain), then the next feature request would be "why not let clients send A-records too"? and so-on :P [18:50:55] it's a fair question :P [18:50:59] in labs I imagine I'll just point this at a script that updates Designate, which has an API [18:51:44] I see in the code also that acme-http-01 was implemented as well? [18:52:14] paravoid: because then we fall into the deeper rabbithole of maintaining state for them, and general-purpose concurrency control with the N DNS threads vs the updaters touching things all over the zone data tree, etc... [18:52:32] and I'm trying to avoid that complexity, which is likely not great for resilience/performance either [18:53:10] if anything I'd like to move more in the direction of pre-coded responses even (where output packets for most queries are already 99% assembled as they'd look on the wire, in the in-memory DB) [18:53:32] are we planning to do acme-http-01 for some stuff and acme-dns-01 for others? [18:53:54] I expect we'll end up using dns-01 pretty much everywhere in prod [18:54:07] the operational plan here is just to use dns-01 for it all. but certcentral also supports http-01 for testing and/or for others' use/integration. [18:54:07] currently everything touching ACME in prod is http-01 though [18:54:16] and/or just in case my gdnsd release never made it out the door [18:54:20] heh [18:54:54] I'd love to learn more about that discussion and pros/cons, is it documented anywhere? [18:55:09] which particular part? [18:55:14] http-01 vs dns-01? [18:55:34] that yes, but anything else you have I'll take it to :P [18:55:44] so, dns-01 is necessary to support wildcards. So we have to use it for at least that case. [18:55:55] s/to/too/ [18:56:48] so the arguments are really about whether to also use http-01 where possible. One counter is if we use it only for wildcards, and wildcards are a small subset of the certs we plan to issue, then dns-01 just doesn't get regular exercise and we might be surprised by breakage. [18:57:04] (e.g. some dns puppetization breaks the dns-01 mechanism and 3 weeks later we finally try to get a cert and wonder when it broke) [18:57:42] but the more compelling argument is it eliminates having to separately integrate http routing for all possible http servers using certcentral's certs [18:58:23] (e.g. nginx + varnish + ATS + apache + gerrit/java + ...)... all of them would need separate custom integration for their (hopefully multiple) public HTTP/S endpoints to route challenge requests back to certcentral [18:58:35] this is great and thanks for going through the effort, but I'm running out of time and need to go -- I'm sorry :( [18:58:40] or some other integration which acutally pushes challenge data as a static file out to them all and loading it [18:58:52] vs the singular integration to gdnsd for dns-01 that works for all cases [18:58:53] I don't want to pester you too much, if there's anything in docs/tasks/etc. I can read it all later [18:59:15] rationales are not well-documented :) [19:00:07] the 3rd point is that with dual relatively stateless (at least, not aware of each other) certcentrals in 2x DCs, the http-01 challenge problem gets even harder if you take the routing method (which one do they route requests to for challenges from one or the other's reissuance?) [19:00:15] there's also a new one, TLS-ALPN-01, in the works [19:00:27] don't know if that's allowed with Let's Encrypt yet [19:01:09] I imagine it has the same drawbacks as HTTP-01 [19:01:23] 10netops, 10Operations, 10cloud-services-team: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10RobH) Ok, so we've been discussing this in IRC. when trying to use cloudvirt1023 in the labs-hosts1-b-eqiad vlan, if it has NO specific entry for the kerne... [19:10:22] legoktm, where in debian/rules would I run help2man? [19:10:42] hmm let's see what other packages do [19:13:20] so it looks like most packages create a separate make target for the manpages, but commit the results [19:13:22] e.g. https://sources.debian.org/src/python3.7/3.7.0-6/debian/rules/?hl=768#L768 [19:13:44] https://sources.debian.org/src/jbigkit/2.1-3.1/debian/rules/?hl=28#L28 [19:13:54] (via https://codesearch.debian.net/search?q=help2man+path%3Adebian%2Frules ) [19:27:35] 10netops, 10Operations, 10cloud-services-team, 10Patch-For-Review: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10RobH) Ok, I'm going to outline all the troubleshooting steps below that I've done to demonstrate that the issue is inherently one with... [19:40:05] 10netops, 10Operations, 10cloud-services-team, 10Patch-For-Review: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10RobH) **cloudvirt1024.eqiad.wmnet is in the labs-hosts1-b-eqiad vlan/subnet with the IP address of 10.64.20.43. loading stretch over h... [19:48:04] legoktm, I can't run my binary until the package is installed [19:48:15] well [19:48:17] entry point [19:48:27] so help2man won't work on a system just trying to build without having it already installed [19:50:16] to make that man file I had to build a package without it, install it, run help2man and save the result, remove the package, commit the file and rebuild the package to include the man file [19:51:25] Krenair: ahh, nvm then [19:51:43] not worth the effort [19:57:48] ! [remote rejected] upstream/0.1 -> upstream/0.1 (prohibited by Gerrit: createTag not permitted) [19:58:25] can't do it through the GUI either, 403 Cannot create annotated tag "refs/tags/upstream/0.1" [19:59:25] sigh [19:59:28] okay I know what this is [20:00:56] 10netops, 10Operations, 10cloud-services-team, 10Patch-For-Review: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10RobH) **cloudvirt1024b.eqiad.wmnet is in the private1-b-eqiad vlan/subnet with the IP address of 10.64.16.27. loading stretch over ht... [20:00:59] okay there we go [20:01:22] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/certcentral/+/f61892a46031d67d4b2e49b351c2418a25979603 [22:17:38] 10netops, 10Operations, 10ops-eqiad: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 (10ayounsi) p:05Triage>03High