[08:06:05] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#11859386 (10ayounsi) 05Resolved→03Open There is some hope that Junos 25.4R1 comes with gNMI support - https://apps.juniper.net/feature-explorer/select-platform.html?... [08:15:13] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, and 2 others: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11859405 (10ABran-WMF) The last test with `delayed_close_timeout` did not pan out. I have enabled debug logs on several... [08:43:09] 10netops, 06Infrastructure-Foundations: Upgrade netflow hosts to Trixie - https://phabricator.wikimedia.org/T424478 (10ayounsi) 03NEW p:05Triage→03Low [09:15:06] 06Traffic, 03Wikimedia-Hackathon-2026: Wikimedia Hackathon 2026: Wikimedia's Production DNS Infrastructure and GeoDNS User Routing - https://phabricator.wikimedia.org/T423331#11859847 (10Johannes_Richter_WMDE) misclicked 😅 [09:17:36] 06Traffic, 03Wikimedia-Hackathon-2026: Wikimedia Hackathon 2026: Wikimedia's Production DNS Infrastructure and GeoDNS User Routing - https://phabricator.wikimedia.org/T423331#11859880 (10Aklapper) If this is a talk, then it isn't a project but a session. [09:19:36] 06Traffic, 03Wikimedia-Hackathon-2026: Wikimedia Hackathon 2026: Wikimedia's Production DNS Infrastructure and GeoDNS User Routing - https://phabricator.wikimedia.org/T423331#11859897 (10Aklapper) a:03ssingh [10:05:25] FIRING: SystemdUnitFailed: bird.service on doh5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:25] RESOLVED: SystemdUnitFailed: bird.service on doh5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:25] FIRING: SystemdUnitFailed: bird.service on doh5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:52:46] 10netops, 06Infrastructure-Foundations: Upgrade netflow hosts to Trixie - https://phabricator.wikimedia.org/T424478#11860419 (10ayounsi) [11:04:25] RESOLVED: SystemdUnitFailed: bird.service on doh5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:07:25] FIRING: SystemdUnitFailed: bird.service on doh5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:52:25] RESOLVED: SystemdUnitFailed: bird.service on doh5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:52:48] FIRING: PuppetFailure: Puppet has failed on doh5003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:55:25] FIRING: SystemdUnitFailed: bird.service on doh5004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:59:22] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11860617 (10A_smart_kitten) [12:00:25] RESOLVED: [2x] SystemdUnitFailed: bird.service on doh5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:48] RESOLVED: PuppetFailure: Puppet has failed on doh5003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:21:31] 10netops, 06Infrastructure-Foundations: mr1-eqiad: move from OSPF to BGP - https://phabricator.wikimedia.org/T421238#11860895 (10cmooney) 05Open→03Resolved [13:46:48] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#11861075 (10cmooney) >>! In T390052#11859386, @ayounsi wrote: > Now we need to figure out if it's worth upgrading the management routers or not, as it's more recent that... [13:53:24] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11861105 (10Jclark-ctr) For Eqiad, I would choose A3, C1, and either E8 or F8. A3 is currently 1G, and C1 is pending the arrival of new switches. It was previously out for fundraising. [14:03:13] sukhe: o/ anything specific to do for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1277509 ? [14:04:31] elukey: meeting, will respond shortly [14:05:48] sure sure [14:09:14] elukey: simply disabling puppet on one, testing it and going with it [14:09:27] I can take care of it if you want but obviously feel free to yourself :> [14:16:11] sukhe: yeah yeah it is the testing it part that I am less familiar, and the blast radious if anything fails :D [14:17:17] yeah, testing is well basically doing an account creation on testwiki and seeing if hcaptcha loads. I will take care of it. [14:17:27] not too worried about this but yeah [14:17:30] 06Traffic, 06SRE: Investigate port 80 page in text@esams for Ipv6 - https://phabricator.wikimedia.org/T423667#11861274 (10jasmine_) [14:17:36] we have alreayd rolle this out in other places right? [14:25:29] yes correct, all working [14:25:35] lemme rebase the changes to allow you to go [14:26:10] done! [14:26:28] thanks elukey! [14:26:47] well thank you! Lemme know how I can help if needed [14:27:43] the other one is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1277512 but I know more purged [14:28:27] I will leave that for fabfur, he loves purged :P [14:35:15] :| [14:37:45] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, and 2 others: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11861458 (10ABran-WMF) I'm still seeing a 0.1% rate of CI builds failing with `GnuTLS recv error (-54)`. The debug log d... [14:38:07] fabfur: I can take care of it don't worry, I'll just warn you beforehand [14:38:23] ack, thanks! [14:49:13] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11861561 (10Ladsgroup) [14:51:25] 10netops, 10bacula, 10Data-Persistence-Backup, 06Infrastructure-Foundations, and 2 others: netbox2003 backups (maybe others?) are missconfigured or failing to find the backup directory - https://phabricator.wikimedia.org/T423689#11861581 (10jcrespo) 05Open→03Resolved Resolving unless issues are see... [14:53:10] actually fabfur, any problem if I roll out purged now? [14:59:39] ok for me [15:06:20] 10netops, 06Infrastructure-Foundations: Upgrade netflow hosts to Trixie - https://phabricator.wikimedia.org/T424478#11861669 (10ayounsi) [15:08:38] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11861674 (10ayounsi) [15:08:43] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11861676 (10ayounsi) [15:09:22] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11861677 (10ayounsi) Great, thanks ! Task description updated. [15:14:10] 06Traffic, 13Patch-For-Review: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792#11861686 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Rate limit has been in place for many months now, it has gotten stricter and it's basically now 400 for all non-standar... [15:19:01] fabfur: I applied the change to cp1115 and it looks good (grafana, journalctl, etc..) [15:19:16] if you are ok I'll re-enable puppet and let it do its work [15:19:28] elukey: he stepped out for a bit, so checking [15:19:52] okokI To keep archives happy, I checked https://grafana.wikimedia.org/d/RvscY1CZk/purged-instance-drilldown?orgId=1&from=now-3h&to=now&timezone=utc&var-site=eqiad&var-cluster=cache_text&var-instance=cp1115 [15:20:15] yep that looks correct [15:22:14] elukey: checking again, will confirm [15:22:47] ah snap sorry I thought that was a confirmation [15:22:49] elukey: looks good, let it roll :) [15:23:02] yeah my bad, I was commenting on the dashboard you were checking. but all good [15:23:03] all right re-enabled puppet so it will do its job :) [15:23:20] :) [15:23:27] since we are here, should we do hcaptcha as well? [15:23:34] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11861745 (10ayounsi) [15:24:06] elukey: I will take care of that one [15:24:23] there was a deploy around hcaptcha earlier and I was waiting on purpose just to not mix the two things [15:25:39] ah okok so lemme proceed with others in the meantime :) [15:25:55] yep thanks [15:36:32] 06Traffic, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10Continuous-Integration-Config: Purge frontend cache when publish new coverage report under https://doc.wikimedia.org/cover - https://phabricator.wikimedia.org/T423951#11861817 (10LSobanski) a:03Dzahn [15:36:34] 06Traffic, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10Continuous-Integration-Config: Purge frontend cache when publish new coverage report under https://doc.wikimedia.org/cover - https://phabricator.wikimedia.org/T423951#11861822 (10LSobanski) p:05Triage→03Medium [16:13:13] 06Traffic, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10Continuous-Integration-Config: Purge frontend cache when publish new coverage report under https://doc.wikimedia.org/cover - https://phabricator.wikimedia.org/T423951#11862023 (10Dzahn) The apache config template that was edit... [18:50:22] hi traffic o/ just merged [0] is now a good time for a pybal restart? [18:50:22] [0] - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1277684 [18:53:39] yes please [18:54:18] please let us know if you see any weird stuff :) [18:54:58] thank you, will do o> [19:06:30] 06Traffic, 10Continuous-Integration-Infrastructure, 06SRE, 13Patch-For-Review: Lower varnish caching length on doc.wikimedia.org - https://phabricator.wikimedia.org/T184255#11862758 (10Dzahn) Further lowered caching from 1 hour to 10 minutes as a reaction to T423951. [19:07:02] 06Traffic, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10Continuous-Integration-Config, 13Patch-For-Review: Purge frontend cache when publish new coverage report under https://doc.wikimedia.org/cover - https://phabricator.wikimedia.org/T423951#11862760 (10Dzahn) Lowered CDN caching... [21:20:24] sophroid behind lvs now - thanks sukhe!