[03:00:35] FIRING: DiskSpace: Disk space krb1002:9100:/ 2.867% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:05:35] RESOLVED: DiskSpace: Disk space krb1002:9100:/ 2.263% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:28:02] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11957928 (10Marostegui) [05:29:21] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11957930 (10Marostegui) @jcrespo FYI db2250 @FCeratto-WMF can you take care of depooling pc2021 and coordinating db2158? cc @CWilliams-WMF [07:55:29] the moment arrived - pki-root1001 to insetup (prep for decom) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1294179 [08:17:28] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11958191 (10jcrespo) Thanks for the heads up, @Marostegui db2250 needs no special handling or depooling -other than downtiming-, assuming maintenance happens during the day. [08:40:01] 10netops, 06Discovery-Search, 06Infrastructure-Foundations, 06Machine-Learning-Team, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357 (10ayounsi) 03NEW p:05Triage→03Medium [08:41:00] 10netops, 06Discovery-Search, 06Infrastructure-Foundations, 06Machine-Learning-Team, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11958324 (10ayounsi) [08:41:15] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11958339 (10ayounsi) [08:41:18] 10netops, 06Discovery-Search, 06Infrastructure-Foundations, 06Machine-Learning-Team, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11958338 (10ayounsi) [08:43:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11958357 (10ayounsi) [08:43:32] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11958374 (10ayounsi) [08:44:59] 10netops, 06Discovery-Search, 06Infrastructure-Foundations, 06Machine-Learning-Team, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11958376 (10jcrespo) db2183 will require stopping mediabackups in advance, to prevent losing metadata. I will take care of that. For db2... [09:45:39] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, and 3 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11958638 (10ABran-WMF) 05In progress→03Resolved Alerts have been merged, I'm marking this as `Resolved`, feel free to... [10:08:09] 10netops, 06Discovery-Search, 06Infrastructure-Foundations, 06Machine-Learning-Team, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11958744 (10jijiki) @ayounsi `mc2055` and `mc-gp2004` are on A4, and that is by accident. `mc-gp2004` is working as a backup in case `mc... [10:44:20] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11958890 (10ayounsi) [10:51:30] 10netops, 06Infrastructure-Foundations, 06Traffic: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11958926 (10ayounsi) 05Open→03Resolved a:03ayounsi I think we can close this task as the new transport circuits will eliminate that routing loop... [10:53:03] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1019: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632#11958935 (10ayounsi) [10:53:04] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11958936 (10ayounsi) [10:54:06] elukey: nice, so I can remove it from that list https://phabricator.wikimedia.org/T421706 what about krb1002 do you think it could be migrated to a per rack vlan? [10:57:54] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: Add Icinga check for SRX cluster status - https://phabricator.wikimedia.org/T271298#11958946 (10ayounsi) 05Open→03Declined We're not going to add more stuff to Icinga. [11:06:28] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11958978 (10ayounsi) [11:06:29] 10netops, 06Infrastructure-Foundations, 06SRE: Change codfw dns hosts BGP peering to top-of-rack switch - https://phabricator.wikimedia.org/T376894#11958977 (10ayounsi) [11:20:15] dear foundations, I have run into a case of "Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually" [11:21:30] and I am happy to run it manually, as long I am not goig to cause any issues [11:28:22] context is https://phabricator.wikimedia.org/T418922#11956972 [11:33:13] effie: it's safe to run it manually, do you have more logs on what the error was? [11:40:06] 10netops, 06Infrastructure-Foundations, 06SRE: GRE Interfaces statistics not being returned by Juniper MX via gnmi - https://phabricator.wikimedia.org/T403936#11959179 (10ayounsi) 05Open→03Resolved a:03cmooney It's now showing up thanks to {T424683} https://grafana.wikimedia.org/goto/dfnbnedrb28sg... [11:40:55] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Map video and other large files to 'low-priority' network Qos queue - https://phabricator.wikimedia.org/T410133#11959189 (10cmooney) 05Open→03Resolved a:03cmooney We actaully added a mechanism to do this late last year when we had some une... [11:46:12] XioNoX: it was run by jen but let me dig a big [11:50:21] https://www.irccloud.com/pastebin/Bzq81VFI/ [11:51:02] weird [11:51:19] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Map internet-bound upload traffic to low-priority QoS queue - https://phabricator.wikimedia.org/T415649#11959238 (10cmooney) 05Open→03Declined I'm going to close this one. I hadn't fully thought out the way we serve things currently. `uplo... [11:51:26] can you open a task for that? maybe some race condition [11:51:29] sorry wrong paste [11:54:08] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#11959248 (10ayounsi) [11:54:35] I updated it [11:54:46] XioNoX: o/ yeah it should be possible, we failover to krb2002 [11:54:49] lemme check [11:55:26] something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/917359 [11:59:12] effie: yeah no idea what happened, but running it now is a noop [11:59:30] elukey: cool! [11:59:39] XioNoX: thanks! [11:59:53] what is the kadmin failover for? [12:00:36] moritzm: phabricator.wikimedia.org/T421706 [12:00:50] basically re-image with a new IP [12:01:08] would it be possible for puppetserver1001 too? [12:02:11] Puppet server 1001 is a bit more delicate, we'll need to change where we run puppet-merge [12:02:57] it's from 2021 so maybe we can just wait to refresh it? [12:03:25] everything is possible, but yeah failing over puppetserver1001 is tricky, in addition to the puppet merges it also controls ofthe central functions which need to be failed over, like the CA, various systemd timers, possily some requestctl things [12:04:48] excellent welcome task for https://job-boards.greenhouse.io/wikimedia/jobs/7921820?gh_src=2k6diyy41us I would say :-) [12:05:26] hahaha [12:06:55] yeah we need to be able to do it in my opinion [12:07:57] could it be part of the DC switchover? move the primary roles to a codfw puppetserver? [13:14:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393 (10Papaul) 03NEW [14:56:07] 10netops, 06Discovery-Search, 06Infrastructure-Foundations, 06Machine-Learning-Team, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11959954 (10ssingh) Depool for cp2044 looks good; please ping Traffic if you want us to take care of it. [16:01:39] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11960266 (10cmooney) That looks good to me @papaul good stuff. If we use vlan IDs 512/522 I guess the plan would be to change the vlan i... [16:35:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11960461 (10Papaul) @cmooney yes we will change the VLAN-id and rename the VLAN for rack 0603 during the switch migration. so it will be... [16:36:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11960468 (10Papaul) [22:10:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11961402 (10Papaul) [22:34:14] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11961461 (10Papaul)