[01:43:58] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp2043 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=codfw&var-instance=cp2043 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [03:49:05] 06Traffic, 13Patch-For-Review: Refresh trafficserver_backend_requests_seconds histogram - https://phabricator.wikimedia.org/T411584#11649324 (10RKemper) +1 to this — we're frequently running into the 1.2s ceiling during recent periods of WDQS instability. ex: https://grafana.wikimedia.org/d/000000489/wikidata... [04:35:14] 06Traffic: Images randomly fail to load - https://phabricator.wikimedia.org/T418323#11649382 (10Bugreporter) [05:43:58] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp2043 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=codfw&var-instance=cp2043 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [08:35:46] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11649709 (10MatthewVernon) >>! In T414805#11642586, @ShakespeareFan00 wrote: > This change to standardised size has also broken the "... [08:49:18] <_joe_> fabfur: PTAL to cp2043 [08:49:58] it's the new trixie host, not pooled, probably downtime expired [08:51:05] extended silence [10:21:25] FIRING: SystemdUnitFailed: haproxy.service on cp2045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:48] FIRING: PuppetFailure: Puppet has failed on cp2045:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:30:49] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11649996 (10ayounsi) For asw1-23-ulsfo gNMI/TLS issue I've opened Nokia support case 05482268. --- ` We're currently provisioning two new switches. The first... [10:37:44] 06Traffic: Images randomly fail to load - https://phabricator.wikimedia.org/T418323#11650033 (10Aklapper) 05Open→03Stalled Hi @BrokenImages1234, thanks for taking the time to report this! Unfortunately this Wikimedia Phabricator task lacks some information. If you have time and can still reproduce the situat... [11:38:43] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp7009 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=magru&var-instance=cp7009 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [11:39:04] ^^ me [11:43:14] 06Traffic, 10MediaViewer, 10Thumbor, 07Browser-Support-Firefox: 429 too many requests when trying to view to .webp image in MediaViewer in Firefox - https://phabricator.wikimedia.org/T418346#11650283 (10AlexisJazz) [11:44:56] 06Traffic, 10Wikimedia-Site-requests, 07Logos, 13Patch-For-Review: logos/manage.py failing due to 429 (thumbnail steps) - https://phabricator.wikimedia.org/T414048#11650292 (10AlexisJazz) Did this also cause T418346? [11:48:49] 06Traffic, 10Wikimedia-Site-requests, 07Logos, 13Patch-For-Review: logos/manage.py failing due to 429 (thumbnail steps) - https://phabricator.wikimedia.org/T414048#11650299 (10taavi) >>! In T414048#11650286, @AlexisJazz wrote: > Did this also cause T418346? (is Firefox a "non-browser" now?) The logo manag... [11:49:54] 06Traffic, 10MediaViewer, 10Thumbor, 07Browser-Support-Firefox: 429 too many requests when trying to view to .webp image in MediaViewer in Firefox - https://phabricator.wikimedia.org/T418346#11650302 (10AlexisJazz) [11:52:21] 06Traffic, 10Wikimedia-Site-requests, 07Logos, 13Patch-For-Review: logos/manage.py failing due to 429 (thumbnail steps) - https://phabricator.wikimedia.org/T414048#11650306 (10AlexisJazz) >>! In T414048#11650299, @taavi wrote: >>>! In T414048#11650286, @AlexisJazz wrote: >> Did this also cause T418346? (is... [11:53:16] 06Traffic, 10MediaViewer, 10Thumbor, 07Browser-Support-Firefox: 429 too many requests when trying to view to .webp image in MediaViewer in Firefox - https://phabricator.wikimedia.org/T418346#11650308 (10AlexisJazz) [11:58:44] 06Traffic, 06Commons: HTTP 429 error on original image requests on Commons (iOS app by default hiding the Referrer header) - https://phabricator.wikimedia.org/T413570#11650315 (10AlexisJazz) Maybe T414048 and/or T418346 are related somehow? There seems to be an overarching theme of browser-specific 429 errors... [12:08:43] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp2045 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=codfw&var-instance=cp2045 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [12:38:43] RESOLVED: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp7009 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=magru&var-instance=cp7009 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [12:53:35] 06Traffic, 06Commons: HTTP 429 error on original image requests on Commons (iOS app by default hiding the Referrer header) - https://phabricator.wikimedia.org/T413570#11650517 (10Nylki) >>! In T413570#11650315, @AlexisJazz wrote: > Maybe T414048 and/or T418346 are related somehow? There seems to be an overarch... [13:25:59] 06Traffic, 10MediaViewer, 10Thumbor, 07Browser-Support-Firefox: 429 too many requests when trying to view .webp image in MediaViewer in Firefox - https://phabricator.wikimedia.org/T418346#11650627 (10AlexisJazz) [13:32:48] bblack: vgutierrez: have you looked at DAMON at all? TIL but it seems like it could be really interesting for understanding cache host RAM access patterns [13:33:25] cdanis: Damon as in https://damonitor.github.io/posts/damon/? [13:33:28] yeah [13:33:38] I stumbled across https://docs.kernel.org/admin-guide/mm/damon/start.html [13:45:01] I'll take a look, thanks for the ping [13:45:58] probably gets easier once they're on trixie 😅 [13:46:12] that's a work in progress already :D [13:46:16] ye :) [13:46:42] oh, damo is packaged in trixie (but not earlier) [14:10:49] 06Traffic, 06SRE: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377#11650785 (10Aklapper) What is the exact and full error message? What is the exact User Agent string? [14:21:43] FIRING: [46x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [14:23:08] 06Traffic, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 06MW-Interfaces-Team, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424#11650826 (10HCoplin-WMF) @daniel & @MSantos -- Is this still a concern? or... [14:26:24] 06Traffic: Reimage cp20[43-58] to Trixie - https://phabricator.wikimedia.org/T418161#11650830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1003 for host cp2045.codfw.wmnet with OS trixie completed: - cp2045 (**PASS**) - Removed from Puppet and PuppetDB if present and d... [14:26:43] FIRING: [55x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [14:36:43] RESOLVED: [55x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [15:34:47] I'm looking a configuring orchestrator behind the CDN for https://phabricator.wikimedia.org/T317179 but the current orchestrator.wikimedia.org FQDN is configured using a CNAME and the migration could be a bit tricky. Can I get some help with this please? [15:53:08] 06Traffic, 06SRE: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377#11651280 (10derenrich) here is the request in full > > method: GET > uri: https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/Emanu-elNYjeh.JPG/250px-Emanu-elNYjeh.JPG > compressionSta... [15:53:13] federico3: we can help with that [15:53:42] federico3: give us some time to go through it please and we should set up a call? [15:54:11] sukhe: sure, what data can I provide you in the meantime? [15:55:49] federico3: I guess the immediate question is, are there are specific reasons you want this to be behind the CDN other than moving it to a private IP? [15:59:55] sukhe: it's been running on a VM on a public ipaddr directly exposed as orchestrator.w.o as we need to access it with browsers for the data pers. team. Now we have a new VM with a private ipaddr but still need browser access. (If there are alternatives I'm all ears/eyes) [16:03:17] yeah that's one reason for sure. but there are some other considerations here that we can discuss. [16:03:39] that is, if there are any issues for the CDN if we put something behind it. I don't think so in this case but we need to think about it [16:03:41] arnaudb: after thinking a little bit more about it, you don't need an additional LVS service on high-traffic1 for gerrit-replica [16:03:52] federico3: give us some time and we will follow up from Traffic? how does that sound? [16:05:19] sukhe: sure! (some details: the amount of traffic is going to be minimal and there's no need for caching) [16:05:43] yeah makes sense [16:08:58] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp2045 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=codfw&var-instance=cp2045 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [16:10:13] can we silence those for good? :) [16:11:08] but how? [16:12:03] 06Traffic, 06cloud-services-team, 10Data-Services, 10Datasets-General-or-Unknown, 13Patch-For-Review: Move dumps.wikimedia.org HTTP service behind CDN edge - https://phabricator.wikimedia.org/T306550#11651360 (10taavi) >>! In T306550#11624133, @BCornwall wrote: > It would be helpful to state the desired... [16:12:22] yeah good question. I think we can add an exception to the alerting for the hosts that we are working on, an alert specifically for this on alertmanager for the hosts in question, or we can just downtime the hosts in their entirety [16:15:01] actually we would need to fix the alert [16:15:45] Indeed - the hosts are already downtimed - it's the alert that needs work [16:18:21] it's funny cause haproxykafka_valid_messages_total{instance="cp2045:9341"} doesn't contain data at all [16:22:32] so for cp2045 it looks like haproxykafka is toasted [16:23:43] FIRING: [37x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [16:24:09] FIRING: LVSHighRX: Excessive RX traffic on lvs5004:9100 (ens1f0np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5004 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [16:24:35] wow :) [16:28:43] FIRING: [55x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [16:31:57] 06Traffic, 06Commons, 07Wikimedia-production-error: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392 (10AlexisJazz) 03NEW [16:33:33] 06Traffic, 06Commons, 07Wikimedia-production-error: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651445 (10MilkyDefer) Can confirm. I am even having significant difficulty accessing Phabricator. [16:33:44] 10netops, 06Traffic, 06Commons, 06Infrastructure-Foundations, and 2 others: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651446 (10AlexisJazz) [16:34:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs5004:9100 (ens1f0np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5004 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [16:37:31] 10netops, 06Traffic, 06Commons, 06Infrastructure-Foundations, and 2 others: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651469 (10AlexisJazz) [16:40:12] 10netops, 06Traffic, 06Commons, 06Infrastructure-Foundations, and 2 others: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651485 (10Jdforrester-WMF) p:05Triage→03Unbreak! [16:40:58] 10netops, 06Traffic, 06Commons, 06Infrastructure-Foundations, and 2 others: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651487 (10JaydenKieran) Can confirm has been affecting en.wikipedia.org and mediawiki.org too, though seems more s... [16:42:56] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 07Wikimedia-production-error: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651504 (10AlexisJazz) [16:48:43] FIRING: [55x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [16:52:15] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 07Wikimedia-production-error: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651544 (10Nemoralis) https://www.wikimediastatus.net/incidents/dgdcls8b0ybt [16:53:30] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 07Wikimedia-production-error: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651549 (10AlexisJazz) There was also a 5 minute spike in 50x errors at 14:15. Also between 15:30 and... [16:53:43] FIRING: [55x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [16:56:58] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, and 2 others: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651556 (10AlexisJazz) [16:58:12] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 07Wikimedia-Incident: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651560 (10AlexisJazz) [16:58:43] RESOLVED: [55x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [17:27:12] 06Traffic, 06MW-Interfaces-Team, 06MediaWiki-Platform-Team (Radar), 07OKR-Work, 13Patch-For-Review: haproxy: strip x-wmf-* headers from responses - https://phabricator.wikimedia.org/T417781#11651716 (10Ottomata) [17:29:28] 06Traffic, 06MW-Interfaces-Team, 06Data-Engineering (Q3 FY25/26 January 1st - March 31th), 06MediaWiki-Platform-Team (Radar), and 2 others: haproxy: capture x-wmf-* headers in webrequest data set - https://phabricator.wikimedia.org/T417864#11651727 (10Ottomata) [17:42:16] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11651815 (10BTullis) @ayounsi directed me to this ticket after reading: {T418398} I believe that this is also preventing the reimaging of: * `dse-k8s-worker1026` on `lsw1-c2-eqiad` * `dse... [19:22:15] 06Traffic, 06Commons: HTTP 429 error on original image requests on Commons (iOS app by default hiding the Referrer header) - https://phabricator.wikimedia.org/T413570#11652128 (10SuperHamster) > To my knowledge there are now several rate-limits enforced now, that differentiate between client type (those with a... [19:32:34] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 07Wikimedia-Incident: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11652162 (10ssingh) This should now be resolved but leaving to the task author to mark this as "Resolved". We... [20:08:58] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp2045 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=codfw&var-instance=cp2045 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [20:24:25] 06Traffic, 13Patch-For-Review: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11652249 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp2043.codfw.wmnet with OS trixie [20:46:19] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 07Wikimedia-Incident: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11652313 (10Aklapper) [21:09:32] 06Traffic, 13Patch-For-Review: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11652345 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp2043.codfw.wmnet with OS trixie completed: - cp2043 (**PASS**) - Downtimed on Icinga/Alertma... [21:11:59] 06Traffic, 13Patch-For-Review: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11652360 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp2044.codfw.wmnet with OS trixie [21:57:18] 06Traffic, 13Patch-For-Review: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11652477 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp2044.codfw.wmnet with OS trixie completed: - cp2044 (**WARN**) - Downtimed on Icinga/Alertma... [22:03:43] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp2044 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=codfw&var-instance=cp2044 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages