[00:34:23] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11802632 (10DLynch) Definitely still happening -- just saw it on https://integration.wikimedia.org/ci/job/quibbl... [06:55:13] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11802993 (10ArthurTaylor) Same - https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php82/36243/console [08:00:21] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11803098 (10ayounsi) That could help slightly for the few hosts that are in the matching row (allow to move them to a rack without re-numbering). If we look at eqiad and its 18 hosts, that... [08:04:32] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Surge in webrequest validation check - https://phabricator.wikimedia.org/T422030#11803118 (10JAllemandou) [08:07:13] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11803124 (10ayounsi) #dc-ops (@Papaul, @Jhancock.wm, @VRiley-WMF , @Jclark-ctr, @RobH). Which rack, from each "pods" (see task description) could we use to have an additional "public" vla... [08:20:05] 06Traffic, 06Infrastructure-Foundations: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot - https://phabricator.wikimedia.org/T422261#11803153 (10elukey) @BCornwall is there a specific downtime that you have in mind for the LVS servers? So we can have more context.. As Riccardo menti... [08:29:20] FIRING: [2x] DnsboxServiceMismatch: Service authdns-ns2 state mismatch on dns5003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=eqsin&var-instance=dns5003:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [08:31:25] FIRING: [2x] SystemdUnitFailed: gdnsd.service on dns5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:25] RESOLVED: [2x] SystemdUnitFailed: gdnsd.service on dns5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:20] RESOLVED: [2x] DnsboxServiceMismatch: Service authdns-ns2 state mismatch on dns5003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=eqsin&var-instance=dns5003:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [09:01:24] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11803365 (10ABran-WMF) Thanks @DLynch @ArthurTaylor for reporting these builds. The error rate seems steady: `... [09:09:36] 06Traffic, 13Patch-For-Review: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402#11803392 (10Fabfur) [09:23:19] 10netops, 06Infrastructure-Foundations, 06SRE: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11803455 (10cmooney) [09:27:53] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11803469 (10ABran-WMF) A first manual test on gerrit-spare shows no breakage, I will now try to apply that chang... [09:52:31] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11803552 (10ABran-WMF) Envoy now stops reusing connections to httpd on our Gerrit primary instance. I'll monitor... [10:35:16] 06Traffic, 13Patch-For-Review: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402#11803721 (10Fabfur) [10:48:47] 06Traffic, 13Patch-For-Review: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402#11803775 (10Fabfur) 05Open→03Resolved [12:21:35] 10netops, 06Infrastructure-Foundations, 06SRE: Observability: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T422816 (10ayounsi) 03NEW [13:28:57] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11804390 (10ABran-WMF) No impact from that change over the past few hours: ` root@contint1002:/srv/jenkins/build... [14:02:12] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11804537 (10Jhancock.wm) imho, i'd prefer a rack not in A row cause of the two CR racks already taking up real estate. D row has no specialty rack at all so we can easily work around that... [15:00:04] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11804934 (10Jhancock.wm) also papaul is on vacation and i'd like to have his weight in as well [16:09:45] 06Traffic, 06Infrastructure-Foundations: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot - https://phabricator.wikimedia.org/T422261#11805507 (10BCornwall) Thanks for the response, @elukey! Indeed, Icinga would ideally not even be used any more. Since the service in question is plan... [16:25:32] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 4 others: Increase visibility of kubernetes network status - https://phabricator.wikimedia.org/T356877#11805591 (10MLechvien-WMF) a:03Blake [16:56:09] I'm trying to use reprepro and getting an error related to haproxy pkgs: https://phabricator.wikimedia.org/P90343 is it safe to delete these? [17:31:18] inflatador: Those should all be safe to delete - there are still 2.8 versions in use with bullseye and that's not listed [17:58:11] brett thanks, will delete once I get back from lunch [18:00:05] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11805976 (10RobH) Dell has confirmed case update and will dispatch a new mainboard and cpu bracket. Once they do, they'll email/update with tracking and then dispatch will reach back out to schedule... [19:16:20] 06Traffic, 06SRE: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11806246 (10Reedy) [19:27:29] 06Traffic, 06SRE: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11806289 (10Aklapper) > Two API keys under the same username (one for dev, one for prod) Hi, which exact type of "API keys" is this about? > and https://upload.wikimedia.org URIs for image downloads How exactly do you... [19:32:05] 06Traffic, 06SRE: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11806321 (10ssingh) In addition to what @Aklapper has mentioned above, please also include the full error message (that you see along with the 429). Thanks. [19:58:15] 06Traffic, 06SRE: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11806393 (10NakavoDev) Hey, > Hi, which exact type of "API keys" is this about? We've created the api keys in the app management portal {F75436999} > How exactly do you construct such image URIs? We do this in 3 steps... [22:30:35] 06Traffic, 06SRE: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11806792 (10Reedy) By you extract one of the links... What do you mean? Are you always getting thumbs? Or are you sometimes (often?) requesting the originals based on size?