[02:52:35] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:11:02] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12021945 (10Papaul) @ayounsi @cmooney i was looking at moving the mgmt interface to irb.900 and I noticed on all  the mr's there is the default DHCP...
[05:00:26] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12021991 (10Papaul)
[05:02:43] <wikibugs>	 10netops, 06Infrastructure-Foundations: codfw: upgrade routers (2026) - https://phabricator.wikimedia.org/T417871#12021993 (10Papaul) @BCornwall FYI we are planning on doing the cr2-eqdfw Junos upgrade next week on Wednesday June 24th at 10:00am CT. Thanks
[05:37:35] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12022005 (10ayounsi) Indeed, we can remove  that DHCP pool.
[06:01:11] <wikibugs>	 10homer, 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Homer should abort on filter rules applied on non-existent or disabled interfaces - https://phabricator.wikimedia.org/T428886#12022047 (10ayounsi) yeah me neither, it's a tradeoff (engineering time/impact) that I think is accepta...
[06:52:35] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:47:35] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:47:35] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:05:45] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 13Patch-For-Review: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12022835 (10cmooney)
[10:14:01] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 13Patch-For-Review: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12022852 (10FCeratto-WMF) For the `db*` hosts they can be indeed depooled as described.
[10:20:02] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 13Patch-For-Review: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12022872 (10cmooney) >>! In T428020#12022852, @FCeratto-WMF wrote: > For the `db*` hosts they can be indeed depooled as described.  Thanks for the confirmatio...
[10:57:04] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12023045 (10MoritzMuehlenhoff)
[11:00:09] <wikibugs>	 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317 (10SLyngshede-WMF) 03NEW
[11:09:54] <wikibugs>	 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317#12023096 (10Marostegui) @SLyngshede-WMF can you recreate the tables easily? I can drop the db and you can recreate the tables or else I can change the DB...
[11:10:13] <wikibugs>	 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317#12023098 (10Marostegui) p:05Triage→03Medium a:03Marostegui
[11:32:09] <wikibugs>	 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317#12023200 (10SLyngshede-WMF) @Marostegui the application will recreate the tables on start up. It has some Hibernate functionality built in.  It's probably...
[11:32:59] <wikibugs>	 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317#12023206 (10Marostegui) ok - doing it now!
[11:34:21] <wikibugs>	 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317#12023210 (10Marostegui) 05Open→03Resolved Done ` cumin2024@db1213.eqiad.wmnet[(none)]> show create database cas_staging; +-------------+----------...
[11:38:35] <wikibugs>	 10CAS-SSO, 06DBA, 06Infrastructure-Foundations, 05WMF-NDA: Change database encoding for CAS staging database - https://phabricator.wikimedia.org/T429317#12023221 (10SLyngshede-WMF) Thank you, that fixed my issue.... I now have another, but that's not database related :-)
[12:07:05] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12023355 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f96964dd-ded4-440c-9105-9a1b97d2144f) set by cmooney@cumin1003 for 1:00:00 on 5 host(s) and their servi...
[12:10:54] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12023377 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1dff57c0-efff-46db-8f85-423d33775bce) set by cmooney@cumin1003 for 1:00:00 on 29 host(s) and their serv...
[12:30:04] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12023473 (10jcrespo)
[12:40:40] <wikibugs>	 10netops, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE, 06tools-infrastructure-team: Upgrade cloudsw1-e4-eqiad - https://phabricator.wikimedia.org/T429013#12023550 (10fgiunchedi) Indeed the recent rack redundancy testing has shown we are resilient to the loss of one rack, for all hosts but cloudvirts...
[12:41:13] <wikibugs>	 10netops, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE, 06tools-infrastructure-team: Upgrade cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T429014#12023556 (10fgiunchedi) See my update at https://phabricator.wikimedia.org/T429013#12023550 since it applies equally here
[12:51:13] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12023612 (10cmooney) Switch upgrade was successful, all hosts are back online and being repooled.  @MatthewVernon if you want to check thanos-be2006 please do, it's back doing norm...
[13:04:18] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12023706 (10MatthewVernon) Thanos-swift cluster looks good, thanks.
[13:12:35] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:13:08] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12023730 (10cmooney)
[13:13:20] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12023732 (10cmooney) >>! In T426343#12016920, @Papaul wrote: > @cmooney I took a look at the steps all look good to me for...
[13:19:51] <wikibugs>	 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Move WMF5520's switch ports to frack-eqiad-administration vlan - https://phabricator.wikimedia.org/T429340 (10Jgreen) 03NEW
[13:45:31] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12023875 (10cmooney)
[13:47:35] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:49:39] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:49:35] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#12024308 (10Papaul)
[15:54:44] <wikibugs>	 10homer, 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Homer should abort on filter rules applied on non-existent or disabled interfaces - https://phabricator.wikimedia.org/T428886#12024751 (10cmooney) >>! In T428886#12019220, @taavi wrote: > I'm not a huge fan of relying on the exac...
[17:07:30] <wikibugs>	 10netops, 06Infrastructure-Foundations: codfw: upgrade routers (2026) - https://phabricator.wikimedia.org/T417871#12025355 (10BCornwall) Thanks for the heads up, @Papaul! Is there anything you require of traffic during that time?
[17:31:46] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12025477 (10cmooney) 05Open→03Resolved All done here.
[17:32:04] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12025486 (10cmooney)
[17:47:35] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:52:35] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:54:36] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: cr2-esams rpd failure after enabling bgp 'graceful-shutdown' (June 2026) - https://phabricator.wikimedia.org/T429386 (10cmooney) 03NEW p:05Triage→03Low
[18:51:11] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Network device tls certs: alerting niggles - https://phabricator.wikimedia.org/T429242#12026032 (10cmooney) So something odd is going on here.  If I query thanos only two Nokia devices in eqiad are currently showing a //probe_ssl_earliest_cert_expiry// value: `...
[18:51:42] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Blackbox probe for TLS cert expriy failing on multiple eqiad SR-Linux nodes - https://phabricator.wikimedia.org/T429242#12026037 (10cmooney)
[19:35:06] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Blackbox probe for TLS cert expriy failing on multiple eqiad SR-Linux nodes - https://phabricator.wikimedia.org/T429242#12026310 (10cmooney) 05Open→03Resolved Alight small gap when gnmic had to reconnect but other than that lsw1-c4-eqiad is back working...
[19:39:06] <XioNoX>	 topranks: https://www.reddit.com/r/Juniper/comments/1u7ngnt/mx204_end_of_sale_announcement_again_and_final/
[19:39:45] <topranks>	 #sadface 
[19:40:58] <cdanis>	 😔
[19:42:26] <topranks>	 I suspect supply-chain for that HMC memory is part of the equation still  
[19:42:51] <topranks>	 probably getting mad expensive with all the AI "fun".  and they can likely use just a little more and make an MX301 instead 
[21:47:35] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[21:52:35] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed