[04:13:53] We are going to failover s4 master in 45 minutes [09:10:28] access request question: does the deployment group overlap with the restricted group? [09:11:23] XioNoX: no [09:11:28] ok! [09:11:46] XioNoX: i thin i already now which ticket you are on... [09:11:53] Cindy [09:12:18] ah, no that's different [09:12:42] hold on, i gave you the wrong answer then [09:13:28] If you are still relying on ldaplist and not using ldapsearch, [09:13:28] please comment on https://phabricator.wikimedia.org/T114063 [09:13:28] before 30 August 2016. If nobody comments, ldaplist will be removed! [09:13:41] well [09:14:02] https://github.com/wikimedia/puppet/blob/production/modules/admin/README#L70 probably needs updating then [09:14:04] that was written by Yuvi who is not here anymore [09:15:37] XioNoX: fixing my comment. Yes, restricted is a full subset of deployment, having both is not needed [09:15:44] it's only deployment then [09:15:48] thx! [09:17:00] i think https://phabricator.wikimedia.org/T114063 is stalled as in "nobody is working on this" [09:28:07] mutante: not sure I understand what I should do next on that CR [09:32:09] XioNoX: add Tyler or greg-g and get them to review [09:32:59] done, thx! [13:18:13] If I get `Unable to run wmf-auto-reimage-host: Unable to find certificate fingerprint in` when running wmf-auto-reimage-host, did I do something wrong or is that a known/expected failure? [13:18:46] hnowlan: that seems unusual, like a typo in the host name or something [13:18:57] which host is it? [13:19:21] restbase2009 [13:19:29] it was followed by `no certificate found and waitforcert is disabled` though which is clearly a puppet error [13:19:43] hnowlan: I can have a look in 5 [13:19:44] and the host definitely did reprovision [13:20:50] what i can confirm so far is there is a puppet cert for restbase2009.codfw.wmnet on the puppetmaster [13:21:40] yeah, it looks like things worked as expected [13:21:47] just a weird failure to see [13:23:41] I actually see the exact same error in the logs for the reimage you did for restbase2014 a few weeks back mutante [13:24:25] heh, ok :) [13:24:47] let's carlify [13:24:51] yea, the log looks like things worked despite it [13:25:09] 1) it failes, so no, it didn't completed successfully, as the reimage doesn't stop at the first puppet run but does additional things after that [13:25:13] *failed [13:26:20] the "_cumin.out" file actually show the fingerprint of the cert [13:26:26] that it talks about not finding in the .log [13:26:48] 2) the 'Exiting; no certificate found and waitforcert is disabled' is returned by the puppetmaster, not the script [13:26:55] IIRC, double checking now [13:27:31] sorry, wrote it wrong [13:27:37] ----- OUTPUT of 'puppet cert list...et' 2> /dev/null' .... "restbase2009.codfw.wmnet" (SHA256) 92:BD:E4:65:81:1E:B0:A0:35:90:19:DB:19:B4:D3:BB:5E:94:DC:D8:6D:2E:1D:DC:F7:6E:64:29:A5:65:4F:3E [13:27:38] D3: test - ignore - https://phabricator.wikimedia.org/D3 [13:27:38] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [13:27:40] I would expect to see that error when running puppet agent with an unrecognised cert [13:27:41] Removing file Puppet::SSL::Certificate restbase2009.codfw.wmnet at '/var/lib/puppet/server/ssl/ca/signed/restbase2009.codfw.wmnet.pem [13:28:00] stashbot: krkr :) [13:28:01] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [13:28:25] nice job stashbot :-P [13:28:40] it removed the existing cert.. and then a little bit later it says it can't find it [13:28:42] mutante: I ran `puppet cert -s restbase2009.codfw.wmnet` fwiw [13:29:10] mutante: ofc, it removes the cert at the starts, that's exepcted [13:29:49] sure, but not that it is surprised about not findin gone afterwards? [13:29:56] one [13:30:08] that's the client [13:33:43] hnowlan: the usual workflow is this [13:33:58] after d-i the host gets rebooted into the new OS [13:34:27] if we detect systemd as init system, we run 'puppet agent --test --color=false' so that the puppet clients generate a CSR and sends it to the master [13:34:39] we get the outpiut of this run and extract the fingerprint of the cert [13:35:08] and then we go to the puppetmaster and sign it [13:35:23] verifying that the fingerprint matches first [13:35:32] volans: d-i? [13:35:36] debian-installer [13:35:49] oh right [13:36:02] in this case the output of the first puppet run didn't show the fingerprint, like it was already run, that was the case before systemd with init and the old puppet package [13:36:23] so either something has changed here or for some reason the puppet client was already run once [13:37:18] and to make it totally clear: the reimage failed, the host is not installed [13:37:33] has a new OS: yes, has puppet running: no [13:38:06] unless something manual was done after that [13:38:13] I have done manual stuff after that [13:38:38] it's better to understand the failure and let the automation do all the steps [13:38:49] running `puppet agent --test` and then signing on the puppetmaster, which I would expect to do for a host where the cert has changed between installs. [13:39:00] So was the source of the failure the fact that there was a preexisting certificate on the master? [13:39:21] no, that cert has been revoked on the master at the start [13:39:26] it's all taken care by the reimage script [13:39:38] that performs many additional steps [13:39:41] before adn after [13:39:46] the reimage [13:40:18] May 29 13:09:48 puppetmaster1001 puppet-master[25310]: restbase2009.codfw.wmnet has a waiting certificate request [13:40:53] that's before the reimage detected the host was up: [13:40:57] 2020-05-29 13:10:25 [INFO] (hnowlan) wmf-auto-reimage::print_line: Uptime checked [13:41:14] so that means that the puppet agent was already run once at that point [13:41:17] which OS are you installing? [13:42:21] stretch I see [13:42:43] May 29 13:09:45 restbase2009 systemd[1]: Started Puppet agent. [13:42:55] Yep, just wanted to fully reimage it as it was before [13:45:32] jbond42: by any chance do you know if anything has changed recently in the deb package for puppet for stretch? [13:45:37] 5.5.10-2~deb9u2 [13:46:02] it seems that systemd started puppet-agent after the reboot after the debian-installer [13:46:24] that logic has already changed once in the past, hence asking [13:46:39] I have a hunch, let me check [13:47:40] hnowlan: most reimages nowadays are done with buster, hence you might have hit some corner case [13:48:34] indeed, this is a side effect of one of the change which moved Puppet 5 / Facter 3 into "main", making a patch to fix this [13:48:35] One thing I'm curious about is the empty message in the error `Unable to find certificate fingerprint in:`. Looks like there was no output from `puppet_generate_certs` [13:49:06] hnowlan: that's: 'Unable to find certificate fingerprint in:\n{msg}' [13:49:25] so being "Exiting; no certificate found and waitforcert is disabled" the message [13:49:37] ohh, misunderstood that newline [13:49:43] the leading one [13:49:46] because the puppet agent shows the newly generated cert and it's fingerprint only at the first one [13:49:49] *run [13:49:58] at subsequent runs it shows just that message [13:50:15] that is misleading, I know :) [13:50:27] moritzm: you rock, thanks! [13:52:01] https://gerrit.wikimedia.org/r/599867 should fix it [13:53:54] great! thanks a lot. [13:54:02] thanks moritzm! [13:54:12] hnowlan: would you be ok to retry the reimage once merged it not too much trouble? [13:54:21] absolutely, no bother [13:54:44] sorry for the trouble. I'm also curious, why re-imaging into the same OS? :D [13:56:23] Just a dead disk replacement, cassandra running in JBOD so it's simpler to reimage as si [13:56:35] got it :) [13:56:53] hnowlan: merged the patch and ran puppet on preseed servers [13:57:25] there goes my secret plan to silently break stretch installs and force people into moving to buster :-) [13:57:37] haha, sorry :D [13:58:06] We're talking about upgrading I promise [13:58:13] lol [14:00:37] future chaos engineering ideas: have a scheduled task that reimages one random server in the fleet every 6 hours during monday-thurs working hours. let service owners blacklist critical servers that can't be part of that scheme. [14:00:47] use blacklist to guide future efforts to reduce it :) [14:01:12] bblack: +1 [14:01:29] we're discussed/proposed this multiple times in the past few years :) [14:01:34] we can start with bastions, they need to be upgraded to buster and worst case some people need to use another one, heh [14:02:07] bblack: clearly the blocker for this is the service owners catalog :-P [14:02:40] well, creating the mechanism and announcing a start date a few months out might get people thinking about the ones they care about [14:02:43] well kormat just mostly-fixed one other blocker [14:02:48] (keeping /srv around across reimages) [14:03:13] that's cheating a little, but I guess it reduces the blacklist considerably for the first go-round [14:03:51] I don't think it's cheating [14:04:20] it's not cheating if it's an optimization and the contents of /srv would be reproduced on reimage anyways, just slower [14:04:31] yeah, I agree [14:05:10] <_joe_> it is cheating if what you want to test is that your host is expendable [14:05:14] <_joe_> databases clearly aren't [14:05:17] right [14:05:24] <_joe_> not completely expendable at least [14:05:27] but they can be, hypothetically, in some future architecture [14:05:49] <_joe_> bblack: let me rephrase "not expendable without cost/time/more resources" [14:06:08] <_joe_> so in general you don't want to force yourself to reimport all the db data from another host, if possible [14:06:33] or re-shuffle all shards for ES [14:26:56] moritzm: patch worked, thank you! [14:44:57] great :-) [14:50:54] elukey: XioNoX: have a minute for a last glance at https://gerrit.wikimedia.org/r/c/operations/puppet/+/598841 ? [14:51:13] we do seem to have fixed the issue -- https://phabricator.wikimedia.org/T253128#6177207 [14:52:14] cdanis: lgtm [14:52:33] ty, will disable puppet on netflow* and roll out carefully [14:58:27] lgtm! [15:09:18] https://www.youtube.com/watch?v=XkfpopJFAdk [15:09:44] "how to repair network at scale" from google, is the first talk [15:12:57] ty! [15:36:48] XioNoX: the network is never broken [15:37:00] :D [15:37:12] haha [15:37:15] I wish [15:37:55] but then they could not talk about their tooling to open vendor tickets automatically, all the way to the RMA [15:45:24] ahaha [15:46:24] "psychology of ipv6", sounds interesting [15:49:32] I had doubts, but it's fun [15:51:35] "excuse factory" is good [18:06:56] apergos, marostegui: be advised in https://gerrit.wikimedia.org/r/593797 I've moved your cheese, if you need to re-enable that job you can find it under modules/profile/manifests/mediawiki/maintenance/ now [18:16:10] uh thanks though I don't know how I made the notify list :-) [18:20:42] apergos: via https://gerrit.wikimedia.org/r/596172 :) wasn't sure if you'd care about it long-term, but I figured too loud is better than too quiet [18:23:40] lol [18:24:06] nah it's fine, and the next time people need to tinker with the maintenance jobs I'm sure they'll find the new location [18:24:15] 👍 [18:24:30] (p.s. betting odds it won't be me) [18:24:37] haha acknowledged [20:54:41] A friday afternoon puzzle: I have two hosts that I recently rebuilt with Buster. I have some amorphous/invisible connectivity issues and suspect that one of the services involved is choking on ipv6 origination IPs. So… I'm trying to just disable v6 on both servers as an experiment. [20:54:42] sysctl -w net.ipv6.conf.all.disable_ipv6=1 [20:54:48] On one of them, that does what I'd expect [20:54:57] the other falls of the network entirely and can't be reached until a reboot [20:55:04] So… what is different between those two? [21:36:41] oops, I'm wrong, they both fail if I run that command [21:42:27] do they actually fall off the network entirely, or is your ssh connection using ipv6 from the bastion? [22:01:23] cdanis: probably the latter, although I can't subsequently ssh to them [22:01:44] so now I'm on to a new issue, which is that it appears that ipv6 traffic is just filtered between my two hosts [22:01:53] that would certainly account for connectivity issues [22:02:45] ex: [22:02:54] https://www.irccloud.com/pastebin/gVorHwpb/ [22:07:56] andrewbogott: hmm.. that could be fallout from the ipv6 stuff that a.rturo was doing last quarter maybe? Is that IPv6 network non-standard? [22:09:28] connection refused might indicate that the service is configured to listen on ipv4 but not ipv6, netstat -l should help [22:09:35] * cdanis afk [22:09:42] cdanis: yes, that's my current theory although I'm not sure it's complete [22:12:09] it's not that I don't think [22:12:20] I can telnet to that IPv6 address from my laptop [22:13:44] andrewbogott, maybe traceroute? [22:14:37] I think I have the immediate v4/v6 thing sorted [22:14:48] not sure if that's going to help with my 'actual' issue but going to make a patch before I forget [22:16:19] extremely simple change: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/600017/1/modules/openstack/templates/rocky/designate/designate.conf.erb [22:52:30] andrewbogott: yeah, that should bind to all ipv6+ipv4 addrs [22:53:10] yeah, too bad it didn't actually fix the bug :(