[07:01:13] 06Traffic, 10Maps, 06SRE: Allow Wikimedia Maps usage on pediapress.com - https://phabricator.wikimedia.org/T375761#10209484 (10MoritzMuehlenhoff) [07:44:25] 06Traffic: Provide debian packages for liberica - https://phabricator.wikimedia.org/T376600#10209533 (10Vgutierrez) [07:46:45] 06Traffic: Sync liberica etcd library requirements with versions provided on debian bookworm - https://phabricator.wikimedia.org/T376696 (10Vgutierrez) 03NEW [08:10:29] 10netops, 10Ceph, 06Infrastructure-Foundations: cephosd advertised v6 prefix flapping - https://phabricator.wikimedia.org/T376697 (10ayounsi) 03NEW [14:03:38] 06Traffic: Sync liberica etcd library requirements with versions provided on debian bookworm - https://phabricator.wikimedia.org/T376696#10210795 (10Vgutierrez) 05Open→03Invalid [14:17:18] 06Traffic: Provide debian packages for liberica - https://phabricator.wikimedia.org/T376600#10210931 (10Vgutierrez) [14:53:56] hello traffic friends - any objections if I configure a new LVS service [0] at some point in the next couple of hours, and if no objections, any preference on time? [14:53:56] [0] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072796 [14:54:42] swfrench-wmf: all good from our end [14:54:44] thanks for checking [14:54:57] and let us know if you need us to do the pybal restarts and all (I know you will but we are here if you want us to :) [14:57:33] sukhe: awesome, thanks! I'll probably aim for the 16:00 or 17:00 UTC hour (the former if there's no conflict with the puppet request window). [14:57:33] also thanks for offering re: pybal restarts - I'm happy to take care of that, but if anything seems out of the ordinary, I'll flag it (and I'll SAL as I go) [14:57:51] thanks! [15:49:54] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10211237 (10Papaul) [15:50:08] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10211239 (10Papaul) [15:59:03] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10211299 (10ayounsi) About phase 1. I checked the pfw1 config and steps here. Gave some feedback over IRC. Overall lgtm. I didn't check pha... [16:03:20] sukhe: looks like no takers on the puppet window, so I'll start work on this now [16:03:52] swfrench-wmf: ok! [16:05:58] hmmm ... looks like "Check if Pybal has been restarted after pybal.conf was changed" is already firing, albeit only in eqiad [16:06:53] swfrench-wmf: that's OK, it won't alert right now (to CRIT) [16:07:40] oh, I just mean I'm wondering if there are latent diffs that I'm going to be applying with my restart [16:09:16] ah [16:09:22] which lvs host was that? [16:09:25] we can check on puppetboard [16:10:12] lvs1019 and 1020 [16:10:24] gt 4 trigger duration happened ~ 1h ago [16:10:45] so that would put us at something happening around 10 UTC [16:11:49] swfrench-wmf: came out of a meeting now [16:11:50] looking [16:13:07] ah, thanks, and take your time - I'm not in a rush or anything :) [16:14:14] I wish I could show a diff in here somehow so that we don't have to go through puppetboard [16:14:35] so, I'm looking at journalctl for the puppet agent timer on lvs1020, and I can't see any diffs that would reasonably cause this (it's all ssh known hosts updates today) [16:15:04] yeah so it probably was firing before and we don't have that in puppetboard [16:15:21] so some change was made and that was never reflected with a restart [16:15:50] ah, got it [16:16:11] per journal output, the last pybal.conf change was on 10/2 [16:17:05] https://phabricator.wikimedia.org/P69500 [16:17:27] yepp [16:18:18] so yeah, I guess restarts are in order [16:18:35] not fun -- the entire purpose of these alerts was to not be in a state where a change was applied but we didn't restart pybal [16:18:47] so it probably alerted and we never saw it (which is weird?) [16:19:00] yup, and it looks like there's a failing check for this specific service, which I suspect the restart will address [16:19:10] (changes the monitor config) [16:19:13] yeah. I wish I could show an actual diff in the script [16:19:42] ok I am doing the restarts and then we can take it from there [16:19:47] so that at least your end is clear [16:19:52] and then I will see how to improve this check [16:20:00] usually what happens is that when it alerts: we check and then restart [16:20:04] in this case it was missed [16:20:08] and only Puppetboard has the diff [16:20:27] oh, cool - thanks! I was going to offer to batch them up, but +1 to clearing them first [16:20:40] yeah probably better that way [16:20:45] no surprises [16:20:52] :) [16:21:08] https://www.youtube.com/watch?v=u5CVsCnxyXg [16:21:15] * sukhe can't resist [16:21:27] lol [16:26:50] swfrench-wmf: thanks for bringing this up [16:26:56] the alerts should now be cleared [16:27:08] I will figure out a way to make the alert more verbose or at least, alert better [16:27:19] no worries, and thanks so much for handling it! [16:28:37] sukhe: I'll get started shortly, but one question: I see you used the restart cookbook - is that now recommended over the "equivalent" cumin commands in the docs? [16:28:37] happy to use either - just wanted to confirm :) [16:31:25] holding for now, as puppetserver1001 is being taken down for maintenance [16:34:30] swfrench-wmf: you can use cumin command as long as you log the restart [16:34:39] no strong preferences from us [16:35:07] sounds good, and thanks - I'll let y'all know when I'm starting again [16:35:08] as you might remember, we need to ACK some alerts for a successful cookbook run [16:36:12] ah, right - I recall this from the service turndown case (since the ipvsadm check will immediately fail), but I didn't consider that might be the case for a turnup [16:47:26] yeah should be OK for the turn-up so feel free to use the cookbook [16:47:33] if not, manual restart and a !log is also OK [16:47:49] cookbook is basically [16:47:51] sudo cookbook sre.loadbalancer.restart-pybal --reason "picking up pybal changes" --alias lvs-secondary-eqiad restart_daemons [17:01:28] sukhe: great, thank you! [17:01:44] moving forward now that puppetserver maintenance is donw [17:01:49] gl! [17:28:05] seems like the cookbook is serving you well swfrench-wmf! nice. [17:28:24] yeah, it's quite nice :) [17:29:08] the only maybe-awkward part is waiting for the etcd connection check to resolve, but I'm kind of liking it since it's subtle forcing function to go slow [17:29:31] (only slow because it's on a 5m check interval, that is) [17:29:42] anyway, moving on to codfw [17:29:46] :) [17:33:02] swfrench-wmf: there are ways to force a re-check in icinga :D [17:34:22] cdanis: indeed there are, I'm just kind of leaving it be to force slowness :) [17:36:43] 06Traffic, 10ops-magru, 06SRE: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737 (10ssingh) 03NEW [17:37:12] 06Traffic, 10ops-magru, 06SRE: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10211781 (10ssingh) p:05Triage→03High [17:46:33] all done - thanks, sukhe! [17:46:40] <3 you did the work [18:04:34] 06Traffic, 10ops-magru, 06SRE: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10211918 (10wiki_willy) a:03RobH [18:22:12] FYI, going to make some updates to [0] - (1) switch to production is a noop for LVS, (2) discovery DYNA records patch should *not* be merged yet (puppet patch should be merged, and puppet needs to run on dnsboxen in to create the DNS resources before moving on) [18:22:12] https://wikitech.wikimedia.org/wiki/LVS#Add_discovery/DNS_resources [18:22:48] sounds good [18:22:51] 06Traffic, 10ops-magru, 06SRE: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10211985 (10RobH) I chatted with @ssingh about this via IRC: The directions will be to pull the 8 of 9 misc hosts and 8 cp hosts out of the racks. These... [18:27:34] sukhe: thanks! should I +2 and puppet-merge? :D [18:27:52] ahh I missed the nits [18:27:58] cdanis: feel free to. they are just nits [18:28:12] but merging directly will clear the caches as expected so you might want to disable puppet on A:dnsbox [18:28:19] ack [18:28:25] and then roll out slowly with -b1 -s120 or something (that's what I do at least) [18:28:44] shouldn't the usual puppet splay be good enough? [18:29:07] I guess you're saying I should be sure :) [18:29:08] I think this is one of the "better be in control" things for me [18:29:11] ok! [18:29:21] thanks for the patch! [18:30:34] (on the doh* hosts, we don't do automatic pdns-rec restarts but on DNS hosts, puppet does it for us) [18:30:40] ahh ack [18:30:45] I'll definitely do the whitespace change then [18:32:57] and yeah I was torn about doing the minimal change from the defaults or not [18:33:04] yeah same here [18:33:09] I don't think we need it but I see why you did it [18:33:23] the pdns-rec ACL is good enough and that's what we depend on for other stuff [18:33:36] if you think the /16 are too restrictive you should feel free to do a larger /8 [18:34:13] yeah for now I'm not going to worry about it, but if we ever have to expand it, might as well [18:34:34] fair [18:34:44] if we want to bikeshed this more, I can always ping bblack [18:34:46] * sukhe hides [18:34:55] hahaha [18:45:23] cdanis: is your change one that should *not* be concurrent with an authdns-update? if so, I can hold on pursuing https://gerrit.wikimedia.org/r/c/operations/dns/+/1072794 :) [18:45:51] swfrench-wmf: should not affect it [18:47:00] sukhe: great, thank you! I wasn't quite sure what "it" was in this context, so wanted to make sure. [18:47:15] aaand thank you for the review :) [18:47:25] yeah, my change just touches the powerdns recursor config [18:48:11] cdanis: ah, got it - thank you [18:53:23] 06Traffic, 10ops-magru, 06SRE: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10212100 (10ssingh) Thanks for writing it down @RobH. 1. Ganeti hosts: I think we can simply point to another installserver if this means doing this in o... [19:09:21] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212159 (10Papaul) @Jgreen @Dwisehaupt when do you think you will have time to relocate the 4 servers in the table that have "YES" on the... [20:49:57] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212539 (10Papaul) [21:08:19] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212583 (10Papaul) [21:17:02] win 6 [21:36:27] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes, 13Patch-For-Review: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171#10212679 (10CDanis) [21:39:12] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes, 13Patch-For-Review: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171#10212682 (10CDanis) ===== Works in prod now: {P69502} == Remaining work to do: [ ] {T376291} [ ] {T376762} [21:42:28] 06Traffic, 06Data-Platform, 10Data Products (Data Products Sprint 20 🎯), 13Patch-For-Review: NEW BUG REPORT - Issues in calculation logic for unique devices tables - https://phabricator.wikimedia.org/T375527#10212699 (10Mayakp.wiki) @odimitrijevic: here is the [[ https://docs.google.com/document/d/1dECNZRR... [22:37:41] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212771 (10Papaul) @Jhancock.wm we are going to put civi2001 on the new switch on port 7 since on U6 we have a 2U server so we will just be... [22:38:38] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212778 (10Papaul)