[03:07:19] 06Traffic, 06SRE, 10WMF-General-or-Unknown: API rate limit triggered for regular user - https://phabricator.wikimedia.org/T424588#11864010 (10Bugreporter) [03:45:43] FIRING: [2x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [03:50:43] FIRING: [12x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [03:55:43] FIRING: [15x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [04:00:43] FIRING: [15x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [04:05:43] RESOLVED: [15x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [04:57:22] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11864039 (10Papaul) [06:07:24] 10netops, 06Infrastructure-Foundations: POPs - free up 2x100G ports - https://phabricator.wikimedia.org/T424611 (10ayounsi) 03NEW p:05Triage→03High [06:11:47] 10netops, 06Infrastructure-Foundations: POPs - free up 2x100G ports - https://phabricator.wikimedia.org/T424611#11864154 (10ayounsi) [07:00:09] 06Traffic: API rate limit triggered for regular user - https://phabricator.wikimedia.org/T424588#11864242 (10Aklapper) @TmY_e12: Hi, which web browser(s) and version(s) is this about? [07:04:08] FYI, the durum hosts in eqsin will briefly alert due to a Ganeti change as part of the migration to routed Ganeti [07:07:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum5001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=eqsin&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:12:00] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum5001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=eqsin&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:15:30] FIRING: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum5001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=eqsin&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:20:30] RESOLVED: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum5001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=eqsin&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:24:25] FYI, the hcaptchy-proxy hosts in eqsin will briefly alert due to a Ganeti change as part of the migration to routed Ganeti [08:08:16] 10netops, 06Infrastructure-Foundations, 10Observability-Metrics: gNMIc: investigate new "collector" command - https://phabricator.wikimedia.org/T416360#11864411 (10ayounsi) All gnmic instances have been upgraded to 0.45.0 [08:43:04] 06Traffic: API rate limit triggered for regular user - https://phabricator.wikimedia.org/T424588#11864497 (10TmY_e12) I get the same error across multiple browsers, including Opera GX, Edge, and Firefox, as well as on my mobile phone when using mobile data. [08:51:41] 06Traffic: API rate limit triggered for regular user - https://phabricator.wikimedia.org/T424588#11864527 (10TmY_e12) [10:13:10] 10netops, 06Infrastructure-Foundations, 06SRE: Network QoS: expand support to Nokia switches - https://phabricator.wikimedia.org/T424639 (10cmooney) 03NEW p:05Triage→03Medium [10:15:28] 06Traffic: "Error 429: Too many requests" triggered for regular user - https://phabricator.wikimedia.org/T424588#11865039 (10Aklapper) [10:15:50] 06Traffic: A single attempt to download a file refers the user to the robot policy - https://phabricator.wikimedia.org/T424619#11865040 (10Aklapper) [10:18:06] 06Traffic: "Error 429: Too many requests" triggered for regular user - https://phabricator.wikimedia.org/T424588#11865042 (10Fabfur) The number of this kind of errors should be lower now, as we've tuned the rule to accomodate for higher ratelimiting [10:34:27] 10netops, 06Infrastructure-Foundations, 06SRE: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640 (10cmooney) 03NEW p:05Triage→03Medium [10:34:33] 10netops, 06Infrastructure-Foundations, 06SRE: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640#11865094 (10cmooney) [10:34:35] 10netops, 06Infrastructure-Foundations, 06SRE: Network QoS: expand support to Nokia switches - https://phabricator.wikimedia.org/T424639#11865095 (10cmooney) [10:45:42] 10netops, 06Infrastructure-Foundations, 06SRE: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640#11865146 (10cmooney) [10:48:13] 10netops, 06Infrastructure-Foundations, 06SRE: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640#11865157 (10cmooney) [10:48:59] 10netops, 06Infrastructure-Foundations, 06SRE: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640#11865159 (10cmooney) [10:53:54] 10netops, 06Infrastructure-Foundations, 10Observability-Metrics, 13Patch-For-Review: gNMIc: investigate new "collector" command - https://phabricator.wikimedia.org/T416360#11865201 (10ayounsi) Manually tested on netflow4003 and works well. Two differences: The metrics/graph `sum by (source) (rate(gnmic_su... [10:55:51] 06Traffic: "Error 429: Too many requests" triggered for regular user - https://phabricator.wikimedia.org/T424588#11865204 (10Aklapper) [10:55:54] 06Traffic: A single attempt to download a file refers the user to the robot policy - https://phabricator.wikimedia.org/T424619#11865207 (10Aklapper) →14Duplicate dup:03T424588 [10:56:07] 06Traffic: "Error 429: Too many requests" triggered for regular user accessing an image for the first time - https://phabricator.wikimedia.org/T424588#11865209 (10Aklapper) [13:04:42] jasmine_: you did all the work :) [13:04:59] we should look into fixing the probe healthcheck [13:12:12] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, and 2 others: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11865824 (10ABran-WMF) I did not identify any obvious pattern in the recent build failures. Envoy logs the last http res... [13:28:48] 06Traffic: "Error 429: Too many requests" triggered for regular user accessing an image for the first time - https://phabricator.wikimedia.org/T424588#11865951 (10CDanis) 05Open→03Resolved a:03CDanis Apologies, SRE put a rule in place as part of responding to an especially-impactful scraping incident,... [13:30:44] 06Traffic, 10Liberica, 10Prod-Kubernetes, 06Data-Platform-SRE (2026-04-24 - 2026-05-15), 07Kubernetes: Migrate DSE k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420437#11865963 (10BTullis) I have finished the migration of the dse-k8s api servers to ipip. Now I will follow up wi... [13:52:12] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11866061 (10Papaul) [15:25:06] 10netops, 06Infrastructure-Foundations: POPs - free up 2x100G ports - https://phabricator.wikimedia.org/T424611#11866503 (10cmooney) My basic thoughts on this are: * We create a new vlan on each top-of-rack switch at the POPs for the "core router transport" ** suggest cr-ibgp- for it * We allocate... [15:26:30] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11866517 (10cmooney) [15:36:45] hey, just wanted to flag https://phabricator.wikimedia.org/T419887#11811663 to the traffic team [15:36:48] it seems like it's might justifiably be low*er* priority than some of the other stuff that's probably going on, but i just wanted to flag it up as it seems like it might be potentially impacting some folks that try to visit the wikimedia status page [15:37:24] (i was prompted to mention this in IRC just now, as I tried to visit wikimediastatus in my browser for an unrelated reason rn, and the page wasn't loading [for, i believe, that reason]) [15:45:02] A_smart_kitten: hi, thanks for flagging it. cwhite put up a patch at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1242499 and there is more discussion there. [15:45:11] also see T419887 [15:45:12] T419887: Move wikimediastatus.net 301 to ncredir - https://phabricator.wikimedia.org/T419887 [15:45:26] oh right, that's the one you linked [15:46:09] we have two ways of going about it: [15:46:35] 1. we use the http-01 challenge in acme-chief and update ncredir to use that. but we don't do that in production anywhere else and that is some work, and also there is no other use case. [15:48:33] and well, the other option I had in mind is really not an option, so 1 it is. [15:48:57] sukhe: ty for the info/replies :) [15:49:16] 2 was some HTTP-level redirect at Markmonitor but now I recall we can't do that in this case [15:50:20] to be clear i don't mean to be overly pushy, so apologies if my message here is coming across like that! i guess i am just a little worried that some folks might be attempting to visit wikimediastatus, and it might be appearing to be completely down/timing out for them rn (due to them not visiting the www. domain) [15:50:43] 10netops, 06Infrastructure-Foundations, 06SRE: Network QoS: expand support to Nokia switches - https://phabricator.wikimedia.org/T424639#11866670 (10cmooney) [15:50:44] yeah that's completely fair and I don't think you were pushy (also being pushy sometimes help I guess, though on how it's done :) [15:50:58] I will discuss with the team again and see if we can make the http-01 thing happen [15:51:25] on the acme-chief side I don't suspect it is much work, the ncredir side yes. it's the priority of making something happen for a one-off like this, essentially, not discounting its importance [15:54:10] ack [15:55:18] 10netops, 06Infrastructure-Foundations, 06SRE: Network QoS: expand support to Nokia switches - https://phabricator.wikimedia.org/T424639#11866721 (10cmooney) [16:37:32] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11867023 (10Papaul) [16:55:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11867079 (10ssingh) >>! In T408892#11749076, @ayounsi wrote: > As a side note we will need to manually change the IPs of the routed ganeti nodes in rack 23 to... [17:09:39] 10netops, 06Infrastructure-Foundations, 06SRE: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683 (10cmooney) 03NEW p:05Triage→03Medium [17:12:02] 10netops, 06Infrastructure-Foundations, 06SRE: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11867159 (10cmooney) [17:25:06] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations: ulsfo switch work May 2026: Host reimaging - https://phabricator.wikimedia.org/T424686 (10ssingh) 03NEW [17:25:18] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations: ulsfo switch work May 2026: Host reimaging - https://phabricator.wikimedia.org/T424686#11867263 (10ssingh) p:05Triage→03Medium [17:25:54] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations: ulsfo switch work May 2026: Host reimaging - https://phabricator.wikimedia.org/T424686#11867268 (10ssingh) [18:19:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 3 others: Lumen 10G transport 442550293 disconnection - https://phabricator.wikimedia.org/T424758 (10RobH) 03NEW [18:26:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 3 others: Lumen 10G transport 442550293 disconnection - https://phabricator.wikimedia.org/T424758#11868220 (10RobH) @Papaul: Please advise what the exact patch panel port https://netbox.wikimedia.org/circuits/circuits/103/ lands on before I... [19:28:51] 06Traffic, 10MediaWiki-File-management, 10MediaWiki-General: Make MediaWiki comply with Wikimedia user agent policy - https://phabricator.wikimedia.org/T424768 (10LucasWerkmeister) 03NEW [19:36:40] 06Traffic, 06Commons, 10MediaWiki-File-management, 10MediaWiki-General, 13Patch-For-Review: Make MediaWiki comply with Wikimedia user agent policy - https://phabricator.wikimedia.org/T424768#11868423 (10LucasWerkmeister) The above three patches are currently deployed on [FactGrid](https://database.factgr... [21:22:45] 06Traffic, 10MediaWiki-File-management, 10MediaWiki-General, 13Patch-For-Review: Make MediaWiki comply with Wikimedia user agent policy - https://phabricator.wikimedia.org/T424768#11868753 (10Peachey88) [21:24:52] 06Traffic, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10Continuous-Integration-Config: Purge frontend cache when publish new coverage report under https://doc.wikimedia.org/cover - https://phabricator.wikimedia.org/T423951#11868760 (10Dzahn) @Umherirrender The maximum time to wait i... [22:46:44] 06Traffic, 10MediaWiki-File-management, 10MediaWiki-General, 05MW-1.43-release, and 4 others: Make MediaWiki comply with Wikimedia user agent policy - https://phabricator.wikimedia.org/T424768#11868996 (10Reedy) [22:49:10] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11869006 (10cmooney) I had a stab at this in the above patch. Some notes on the event processors added: |Name|Event Process...