[08:58:08] head's up, we're about to start the live-test of the switchdc [08:58:35] so puppet will be disabled on eqiad/codfw CP hosts, then re-enabled selectively running puppet [08:58:45] ok [08:59:49] vgutierrez, Krenair, bblack: https://engineering.autotrader.co.uk/2018/09/04/letsencrypt-at-scale.html this is for you! :) [09:19:13] ema: can you confirm those are the right hosts to disable puppet for traffic? [09:19:25] cp[2001,2004,2007,2010,2013,2016,2019,2023],cp[1075,1077,1079,1081,1083,1085,1087,1089] [09:20:40] volans: text only? [09:21:11] you tell me :D the wiki says so [09:24:20] volans: what are you about to do exactly? [09:24:31] switchdc [09:24:38] live test [09:24:53] https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_0_-_preparation point 1 [09:26:24] so basically we'll switch from codfw to eqiad [09:26:38] for traffic is completely "fake" as there are no puppet patches to be merged [09:26:56] so basically we'll disable pupept and then in the next phases re-enable it selectively running puppet [09:27:02] ok yes, you're testing the part that only affects text [09:27:15] so yeah, those hosts look good to me [09:27:25] AFAIK there is nothing to do for upload in the switchdc wiki [09:27:32] in the mediawiki part [09:27:39] that's handled by the traffic section [09:27:46] right, not in the mediawiki part [09:27:51] and swift [09:28:16] ack, thanks for the heck [09:28:17] *check [09:28:21] lol [09:28:24] <_joe_> we should really set originals active-active as well IMHO, but not *today* [09:28:34] yeah please not today [11:16:52] are we re-using commits from last year or making new ones? [11:18:13] new ones, alex prepared them [11:19:14] ok, I was just reviewing steps off wikitech then realized all the commits links are from 2017 [11:20:37] they should be updated now [11:20:39] AFAIK [11:21:39] they're still 2017 scripts in: https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_1_-_stop_maintenance_and_merge_traffic_changes [11:21:50] is there a topic gathering the new commits? [11:22:03] akosiaris: ^^^ [11:22:06] err s/scripts/gerrit changes/ [11:22:16] they are updated in https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Days_in_advance_preparation though :D [11:22:33] ah, missed those [11:22:34] * akosiaris fixing [11:23:56] also, in those should we be moving appservers_debug, or not? [11:24:36] IIRC it was discussed and decided not, let me check [11:26:14] cannot find the history I was looking for, cc ema [11:26:29] ok wikitech page updated [11:26:57] no, the mwdebug servers were not in scope [11:27:14] ok [11:27:22] checking out some of the other trafficy bits [11:27:46] yes, please do. I 've uploaded most of them on Friday afternoon, there could be mistakes [11:32:58] akosiaris: yeah there's some mixed up gerrit changes in https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Traffic_-_Services [11:33:39] I think you mean for the services a/a and then codfw commits there to be 458803 and 458804 (instead of 02 and 03 on the wiki) [11:34:02] and then 02 and 05 seem to be the same for just restbase, which is already included in 03 and 04 [11:35:27] there's also some other already-A/A services we could/should swing into those (whereas including more A/Ps is probably out of scope at this point) [11:35:56] webserver_misc_static, performance, planet, puppetboard, releases, wdqs are all A/A as well on text's list [11:37:20] 02 is for making restbase active/active, then 03 is for switching the in scope services to codfw, then 04 is for undoing 03 and getting back to a/a and 05 for getting back to where we were before 02. [11:37:57] wdqs is already in the changed services [11:38:08] oh right [11:38:21] ok let me stare at the 02-05 thing again, maybe I can't think straight enough yet [11:38:22] I 've skipped performance, planet and webserver_misc_static on purpose [11:38:53] I did not even ponder about puppetboard and releases but I doubt they are in scope, although we can argue they are [11:39:01] especially if it doesn't cost anything [11:39:43] akosiaris: ok yeah you're right, the 02-05 commits are correct, it was just confusing me [11:40:10] I can change subjects and put sequence numbers there if that helps [11:40:33] no that's fine, it's just bikeshedding over naming etc [11:40:58] so, re: the other A/As (webserver_misc_static, performance, planet, puppetboard, releases) [11:41:50] they're already live A/A, meaning they're already fully redundant and routing traffic into both side. So there should be no issue with shutting off the eqiad sides of them, and it's more realistic or whatever, to move whatever we easily can. [11:42:09] why would we choose to skip any of them? [11:43:18] when we decided to have the goal one of the premises was to keep it smaller and more contained than the previous one. So it's just about keeping the scope small, I have no technical argument. If you think it's fine, I am fine as well [11:45:14] eh I guess leave it alone [11:45:17] ok [13:05:23] so I managed to package bblack's gdnsd commit adding acme-challenge stuff [13:05:47] and get a cut down version of our authdns puppet class working using it in labs [13:06:37] I used paravoid's github repo with most of the packaging stuff already there, just had to add a few bits for gdnsdctl and stuff [13:07:34] ran `sudo gdnsdctl acme-dns-01 wmftest.org aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa` and then `dig _acme-challenge.wmftest.org TXT @127.0.0.1` said _acme-challenge.wmftest.org. 600 IN TXT "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" [13:07:37] so that works [13:08:20] next I'll try to hook it up with certcentral over SSH [13:13:24] nice [13:13:44] updating from p.void's packaging stuff for 3.x is on my list in general, do you have a link to whatever you already did there? [13:14:33] haven't uploaded it anywhere yet though I could [13:14:54] I'm still trying to finish writing a 2.x -> 3.x whatsnew/transition sort of document. Maybe 75% done now though. I've found/fixed a few more minor things along the way. Documenting things does help you think! :) [13:17:39] oh wait, I did upload them when I moved to working on deployment-certcentral-testdns.deployment-prep.eqiad.wmflabs at /home/krenair/gdnsd - interestingly the tests behaved differently there (stretch) than my local machine (ubuntu 18.04 bionic) [13:18:04] but, only real changes I made that are likely to be useful: [13:18:47] https://phabricator.wikimedia.org/P7527 [13:20:33] I essentially checkout'd your acme-dns-01 commit, then copied in faidon's debian directory, replaced debian/gbp.conf, and then made these changes [13:22:17] these might not be entirely correct but they got stuff working to the extent that I can do the certcentral integration bit [13:27:22] nice, thanks! [13:27:33] yeah there's other changes to be done for sure, but probably don't matter for this testing [13:29:51] as for the puppet stuff, I pretty much commented all the plugin/service_types bits and pieces and removed most zones, also commented http_listen and zones_rfc1035_auto [13:30:18] could probably have sorted the plugins etc. out but didn't need them for this [15:55:28] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) [15:56:22] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) 05Open>03Resolved Request routing to all current applications added. Closing! [16:22:02] krenair@deployment-certcentral03:~$ sudo -u certcentral ssh -i /etc/certcentral/dns-challenges/ssh-key.private certcentral@deployment-certcentral-testdns [16:22:51] Creating directory '/nonexistent'. [16:22:51] Linux deployment-certcentral-testdns 4.9.0-7-amd64 #1 SMP Debian 4.9.110-1 (2018-07-05) x86_64 [16:23:29] krenair@deployment-certcentral03:~$ sudo -u certcentral ssh -i /etc/certcentral/dns-challenges/ssh-key.private deployment-certcentral-testdns sudo gdnsdctl acme-dns-01 wmftest.org aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [16:23:30] krenair@deployment-certcentral03:~$ dig _acme-challenge.wmftest.org TXT @deployment-certcentral-testdns +short [16:23:30] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" [16:27:07] looks sane :) [16:27:23] I do wonder what restrictions we need on the target certcentral user [16:27:35] yeah [16:28:03] so far it has a bash shell and a sudo rule to gdnsdctl acme-dns-01 * [16:28:10] it may end up being some kind of resticted sudo-based on the on the other end of the ssh too. [16:28:45] because gdnsdctl needs all the same rights for acme-dns-01 as it does for other operations like "stop" :) [16:29:00] uh right well I just did this [16:29:07] + sudo::user { 'certcentral': [16:29:07] + privileges => [ [16:29:07] + 'ALL = (root) NOPASSWD: /usr/bin/gdnsdctl acme-dns-01 *', [16:29:07] + ], [16:29:07] + } [16:37:11] yeah [16:37:24] arguably you could set that to the "gdnsd" users the daemon is running as, instead of root [16:38:26] krenair@deployment-certcentral-testdns:~$ sudo ls -lh /var/run/gdnsd/control.sock [16:38:26] srw------- 1 root root 0 Sep 10 16:32 /var/run/gdnsd/control.sock [16:39:17] should that not be root? [16:40:15] gdnsd owns gdnsd.pid but not control.lock or control.sock [16:52:52] is the daemon running as root? [16:53:08] gdnsd.pid is from past versions, the new one doesn't write that anymore [16:53:21] anyways, it's all not that important for your immediate testing [17:10:13] yeah it is [20:07:23] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Dzahn) [20:45:41] 10netops, 10Operations: Intermittent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10ayounsi) I looked at it some time ago, the spike of DDOS_PROTOCOL_VIOLATION matches spikes of broadcast/multicast traffic we observed on asw2-a {F25757789} Spike of syslog messages from pro... [23:29:08] gdnsd fails to load zones if you try to give it a zone with a single nameserver but it'll let you provide the same nameserver twice