[10:01:02] Hey oncallers, I'm going to depool codfw ms swift for a little to repair a bunch of thumbnail container DBs (cf https://phabricator.wikimedia.org/T383053#10953459 ) [10:30:32] That work is done; going to do the same to eqiad this afternoon (will note here first again) [11:02:57] moritzm: I'm getting a diff removing ganeti2022 on the DNS cookbook, ok to deploy those? [11:07:13] yeah, I'm currently running the decom cookbook for it, you can already merge it along [11:07:34] thanks [11:21:36] About to depool eqiad ms swift to do the thumbnail container DB repairs there [12:11:24] repair work done, repooling ms in eqiad [13:51:21] if there are no objections, I will (shortly) resume the sessionstore reimages, moving to eqiad now — no impact expected [15:45:36] hi folks. does someone enjoy Partman recipes? (trick question I know). Traffic could use a review for the new cp servers in codfw that are UEFI-based and the resultant Partman recipe: [15:45:40] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1162840 [15:48:06] I'll have a look tomorrow [15:48:26] oh thanks! [15:57:13] moritzm: thx, feel free to ping me if needed (f.a.b.f.u.r is on PTO) [15:58:16] makes sense, everyone needs a vacation after writing a partman recipe! [16:10:28] (half partman recipe, pls) :) [16:23:51] moritzm: or therapy [16:24:01] a hug, at least [18:32:20] cwhite, jhathaway: fyi the reimage of sessionstore1005 did not go well, when first rebooted it failed to come back up due to a hardware failure of some kind. In the event of another node going down in eqiad, sessionstore will be down (down for clients routed to eqiad). [18:33:36] thanks urandom [18:33:37] I'm trying to see if we can get someone in dcops to look at it, but in case that happens (other otherwise in the meantime) the solution would be to depool eqiad sessionstore [18:33:52] s/other otherwise/or otherwise/ [18:34:23] the only other alternative would be to depool it now, but that will add latency to every transaction [18:34:36] not sure where the balance of trade-offs are [18:36:24] a hypothetical additional node failure is just that, hypothetical, so maybe let it ride and be quick on the depool if we're unlucky? [18:37:28] if a depool is easy to perform, that sounds reasonable [18:37:34] I mean I obviously can't comment on this for sessionstore but that's what we have done in case of some Traffic stuff, particularly the load balancers. there were other options available but the safest one was "just depool the site" if the backup hardware fails as well. [18:37:38] though middle of the night might be rough [18:37:54] ^ yeah. factor that in and also factor in that you may not be around (you -> urandom) [18:45:02] so, I haven't had to do this since we got a cookbook for it, but I assume: `sudo cookbook.sre.k8s.discovery.service-route depool codfw sessionstore` would do it? [18:48:19] I think you want to be depooling eqiad? [18:48:26] oh, right [18:48:30] sorry... copypasta [18:49:51] there's also always `confctl --object-type discovery --reason 'Depool due to risk of further hardware failure' select 'dnsdisc=sessionstore,name=eqiad' set/pooled=false` [18:51:43] yeah, that's how I've done it in the past [18:56:19] but yeah, aside from switching codfw to eqiad and switching the cookbook name to `cookbook.sre.discovery.service-route` (no `k8s`), either one should work [18:57:07] as a bonus, the cookbook wipes recursor caches, which makes things a bit faster [18:58:15] under-rated :) ^ [19:05:37] where did I get k8s? that was copy-pasted from somewhere.. [19:05:44] anyway, yes, thanks [19:06:09] if it's going to be down for day(s) I'll send out an email to give everyone a heads up [19:06:20] ooooh [19:06:43] the help synopsis for `sre.discovery.service-route`! [19:07:05] it's examples [19:07:20] should probably fix those [19:07:59] lol, good find! I'd never noticed that before [21:28:42] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1165161 fwiw [21:36:43] +1 thanks for fixing that!