[02:52:35] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:11:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12021945 (10Papaul) @ayounsi @cmooney i was looking at moving the mgmt interface to irb.900 and I noticed on all the mr's there is the default DHCP... [05:00:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12021991 (10Papaul) [05:02:43] 10netops, 06Infrastructure-Foundations: codfw: upgrade routers (2026) - https://phabricator.wikimedia.org/T417871#12021993 (10Papaul) @BCornwall FYI we are planning on doing the cr2-eqdfw Junos upgrade next week on Wednesday June 24th at 10:00am CT. Thanks [05:37:35] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12022005 (10ayounsi) Indeed, we can remove that DHCP pool. [06:01:11] 10homer, 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Homer should abort on filter rules applied on non-existent or disabled interfaces - https://phabricator.wikimedia.org/T428886#12022047 (10ayounsi) yeah me neither, it's a tradeoff (engineering time/impact) that I think is accepta... [06:52:35] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:35] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:47:35] FIRING: [2x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:45] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 13Patch-For-Review: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12022835 (10cmooney) [10:14:01] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 13Patch-For-Review: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12022852 (10FCeratto-WMF) For the `db*` hosts they can be indeed depooled as described. [10:20:02] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 13Patch-For-Review: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12022872 (10cmooney) >>! In T428020#12022852, @FCeratto-WMF wrote: > For the `db*` hosts they can be indeed depooled as described. Thanks for the confirmatio... [10:57:04] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12023045 (10MoritzMuehlenhoff) [11:00:09] 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317 (10SLyngshede-WMF) 03NEW [11:09:54] 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317#12023096 (10Marostegui) @SLyngshede-WMF can you recreate the tables easily? I can drop the db and you can recreate the tables or else I can change the DB... [11:10:13] 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317#12023098 (10Marostegui) p:05Triage→03Medium a:03Marostegui [11:32:09] 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317#12023200 (10SLyngshede-WMF) @Marostegui the application will recreate the tables on start up. It has some Hibernate functionality built in. It's probably... [11:32:59] 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317#12023206 (10Marostegui) ok - doing it now! [11:34:21] 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317#12023210 (10Marostegui) 05Open→03Resolved Done ` cumin2024@db1213.eqiad.wmnet[(none)]> show create database cas_staging; +-------------+----------... [11:38:35] 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317#12023221 (10SLyngshede-WMF) Thank you, that fixed my issue.... I now have another, but that's not database related :-) [12:07:05] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12023355 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f96964dd-ded4-440c-9105-9a1b97d2144f) set by cmooney@cumin1003 for 1:00:00 on 5 host(s) and their servi... [12:10:54] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12023377 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1dff57c0-efff-46db-8f85-423d33775bce) set by cmooney@cumin1003 for 1:00:00 on 29 host(s) and their serv... [12:30:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12023473 (10jcrespo) [12:40:40] 10netops, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE, 06tools-infrastructure-team: Upgrade cloudsw1-e4-eqiad - https://phabricator.wikimedia.org/T429013#12023550 (10fgiunchedi) Indeed the recent rack redundancy testing has shown we are resilient to the loss of one rack, for all hosts but cloudvirts... [12:41:13] 10netops, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE, 06tools-infrastructure-team: Upgrade cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T429014#12023556 (10fgiunchedi) See my update at https://phabricator.wikimedia.org/T429013#12023550 since it applies equally here [12:51:13] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12023612 (10cmooney) Switch upgrade was successful, all hosts are back online and being repooled. @MatthewVernon if you want to check thanos-be2006 please do, it's back doing norm... [13:04:18] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12023706 (10MatthewVernon) Thanos-swift cluster looks good, thanks. [13:12:35] FIRING: [3x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:08] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12023730 (10cmooney) [13:13:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12023732 (10cmooney) >>! In T426343#12016920, @Papaul wrote: > @cmooney I took a look at the steps all look good to me for... [13:19:51] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Move WMF5520's switch ports to frack-eqiad-administration vlan - https://phabricator.wikimedia.org/T429340 (10Jgreen) 03NEW [13:45:31] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12023875 (10cmooney) [13:47:35] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:49:39] FIRING: [3x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:35] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#12024308 (10Papaul) [15:54:44] 10homer, 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Homer should abort on filter rules applied on non-existent or disabled interfaces - https://phabricator.wikimedia.org/T428886#12024751 (10cmooney) >>! In T428886#12019220, @taavi wrote: > I'm not a huge fan of relying on the exac... [17:07:30] 10netops, 06Infrastructure-Foundations: codfw: upgrade routers (2026) - https://phabricator.wikimedia.org/T417871#12025355 (10BCornwall) Thanks for the heads up, @Papaul! Is there anything you require of traffic during that time? [17:31:46] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12025477 (10cmooney) 05Open→03Resolved All done here. [17:32:04] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12025486 (10cmooney) [17:47:35] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:52:35] FIRING: [2x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:36] 10netops, 06Infrastructure-Foundations, 06SRE: cr2-esams rpd failure after enabling bgp 'graceful-shutdown' (June 2026) - https://phabricator.wikimedia.org/T429386 (10cmooney) 03NEW p:05Triage→03Low [18:51:11] 10netops, 06Infrastructure-Foundations, 06SRE: Network device tls certs: alerting niggles - https://phabricator.wikimedia.org/T429242#12026032 (10cmooney) So something odd is going on here. If I query thanos only two Nokia devices in eqiad are currently showing a //probe_ssl_earliest_cert_expiry// value: `... [18:51:42] 10netops, 06Infrastructure-Foundations, 06SRE: Blackbox probe for TLS cert expriy failing on multiple eqiad SR-Linux nodes - https://phabricator.wikimedia.org/T429242#12026037 (10cmooney) [19:35:06] 10netops, 06Infrastructure-Foundations, 06SRE: Blackbox probe for TLS cert expriy failing on multiple eqiad SR-Linux nodes - https://phabricator.wikimedia.org/T429242#12026310 (10cmooney) 05Open→03Resolved Alight small gap when gnmic had to reconnect but other than that lsw1-c4-eqiad is back working... [19:39:06] topranks: https://www.reddit.com/r/Juniper/comments/1u7ngnt/mx204_end_of_sale_announcement_again_and_final/ [19:39:45] #sadface [19:40:58] 😔 [19:42:26] I suspect supply-chain for that HMC memory is part of the equation still [19:42:51] probably getting mad expensive with all the AI "fun". and they can likely use just a little more and make an MX301 instead [21:47:35] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:52:35] FIRING: [2x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed