[06:29:09] FIRING: LVSHighCPU: The host lvs1016:9100 has at least its CPU 22 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1016 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [06:34:09] RESOLVED: LVSHighCPU: The host lvs1016:9100 has at least its CPU 22 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1016 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [07:46:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp6010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:25] RESOLVED: [4x] SystemdUnitFailed: haproxy.service on cp6002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:40] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 13Patch-For-Review: Surge in webrequest validation check - https://phabricator.wikimedia.org/T422030#11943911 (10JAllemandou) Summarizing a [[ https://wikimedia.slack.com/archives/C05RHK7PS6Q/p1779213929117539 | slack thread ]] from @mforns he... [10:12:33] 06Traffic, 10ContentTranslation, 06LPL Hypothesis, 06Security-Team, and 5 others: CX dashboard can't load page collections on some wikis (blocked by CORS) - https://phabricator.wikimedia.org/T426323#11944140 (10Clement_Goubert) a:03Clement_Goubert I feel pretty confident this is an issue with either ATS... [10:13:03] 06Traffic, 10ContentTranslation, 06LPL Hypothesis, 06Security-Team, and 6 others: CX dashboard can't load page collections on some wikis (blocked by CORS) - https://phabricator.wikimedia.org/T426323#11944144 (10Clement_Goubert) [10:16:20] claime: consider slyngs is rebooting some cp hosts now, a good idea would be to pause one task or the other [10:17:51] Sorry didn't notice claime was working on something. We're currently rebooting drmrs hosts [10:18:27] np he still haven't started disabling puppet on A:cp [10:19:40] not yet yeah [10:19:48] ok I'm hitting drmrs, so I'll wait until you're done, np [10:34:01] slyngs: ping me when you're done with drmrs? [10:34:53] Sure, it'll take a while, reboots are a bit slow [10:36:15] ack [10:51:04] 06Traffic, 10Liberica, 10Prod-Kubernetes, 07Kubernetes, 06ServiceOps new (Next quarter): Migrate Wikikube k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420436#11944235 (10MLechvien-WMF) [12:13:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp6007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:25] RESOLVED: [4x] SystemdUnitFailed: haproxy.service on cp6007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp3066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:25] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp3066:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:37] claime: fabfur: hi. it seems like you two have already synced up on the question of the cx dashboards [13:06:53] anything else to be aware of / can I help in getting more input if required? [13:20:10] 06Traffic, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: No Puppet resources found on instance deployment-cache-upload08 on project deployment-prep - https://phabricator.wikimedia.org/T426822#11944850 (10ssingh) 05Open→03Resolved Should now be resolved; @bd808 already cherry-picked but this has... [13:20:19] I don't think, claime you'll do the tests as soon as slyngs ends reboots ok ? [13:21:26] The final two drmrs hosts are rebooting now. The last one should be done within the hour [13:28:40] FIRING: VarnishPrometheusExporterDown: Varnish Exporter on instance cp6015:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [13:28:43] FIRING: HaproxyKafkaExporterDown: HaproxyKafka on cp6015 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=drmrs&var-instance=cp6015 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [13:29:43] slyngs: what happened to cp6015? downtiming issues or? [13:37:33] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11944981 (10ssingh) >>! In T414411#11914980, @RobH wrote: > Scheduled a new site visit for them to go out this Friday @ 8AM Singapore Time so my Thursday @ 4PM. > > 1-260037210462 Hi @RobH: Was this... [14:05:33] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11945112 (10RobH) Apologies, this ran super late and I neglected to update the task accordingly. The mainboard swap was successful but it appears of the two CPUs, one of them has failed. Dell SG is... [14:08:43] FIRING: HaproxyKafkaExporterDown: HaproxyKafka on cp6015 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=drmrs&var-instance=cp6015 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [14:09:41] yeah cp6015 never came back up it seems [14:09:48] let's try again [14:10:25] though the cookbook is still processing it so I guess we should just wait for that to time out [14:10:36] where's the cookbook running? [14:10:45] slyng.s is running it [14:16:05] sukhe: fabfur yeah testing the revert on two cp nodes once slyng.s is done is the plan [14:16:53] claime: ok, thanks [14:18:56] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11945156 (10ssingh) >>! In T414411#11945112, @RobH wrote: > Apologies, this ran super late and I neglected to update the task accordingly. > > The mainboard swap was successful but it appears of the... [14:31:02] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11945184 (10MatthewVernon) 05Open→03Resolved This change has been implemented in puppet, so this task can be closed. [14:33:23] Cookbook timed out. I'm skipping cp6016 for today [14:35:57] slyngs: cp6016 is the one I'm hitting, is it up? [14:36:27] (or was hitting when I did my tests earlier anyhow) [14:38:30] yeah, probably you are now hitting another host since cp6016 should be depooled [14:51:30] x-cache [14:51:32] cp6012 miss, cp6016 hit/20 [14:51:34] apaprently not [14:53:54] Ok then I'm gonna merge my revert, which means disabling puppet on all cp except cp6012 and cp6016 [14:55:22] ack [15:05:12] yeah problem's still there, so it's a caching difference introduced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1245389 [15:05:46] fabfur: can you +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1290810 so I can put everything back [15:05:57] Then I can remove the puppet lock, and start digging into the plugin stack [15:06:27] Wait I still got a hit [15:07:14] hit? [15:07:23] ah ATS doesn't purge cache based on what backend responds [15:07:33] x-cache [15:07:35] cp6012 miss, cp6016 hit/23 [15:07:40] Even with the routing change [15:10:40] if there's a hit, the routing change is probably just a no-op [15:12:43] cdanis: it should route to a different gateway that doesn't make sense T_T [15:17:28] yeah no it's not a no-op, I can confirm by hitting ATS directly on port 3128 on a different cp node that I get the Via: rest-gateway header [15:17:34] And not on the nodes I did the test change on [15:18:08] ok, merging the revert and unlocking puppet [15:20:33] ack, apologies [15:44:32] I don't understand how it even worked before me migrated because there's nothing I can see that would have avoided that issue with the pre-migration ATS or api-gateway configurations [15:45:30] I suspect this is actually broken on all calls made to api.w.o in client-side js that can return the same content from different wikis [15:47:23] 06Traffic, 07SEO: Bing can't search images from Commons, is Wikimedia denying their requests? - https://phabricator.wikimedia.org/T425850#11945598 (10AlexisJazz) >>! In T425850#11942471, @ssingh wrote: > Can someone confirm on how to reproduce this? If I try to go to Bing and do a reverse search with Commons,... [15:49:37] So either we stop caching these, or we split the cache on Origin, I actually am at a loss for any other solution [16:03:31] coming into this sideways: so the expectation was that existing cache entries (from the old api.wikimedia.org map entry) would not be hits on objects coming from the 3 per-URL-subpath replacement origins? [16:05:34] I mean, fundamentally I don't really get the patch, or what the 3x identical map entries are trying to accomplish [16:06:00] blblack: Not identical, 3 different paths, necessary so that we let through the actual wiki hosted on api.w.o [16:07:31] ok, that's fair [16:07:39] The expectation for the test was that changing what gateway we route through would show if the issue was with the rest-gateway config compared to the api-gateway config [16:07:50] I've now confirmed they have the exact same Origin reflection behaviour [16:08:35] Which means it was probably broken before, so I was trying to figure out if going from a single map without any plugin chain to what we do now could have changed caching behaviour in some way [16:09:20] But I'm not finding anything, which leads me to believe it's possibly always been somewhat broken, unless I've missed something as to how we cache stuff [16:09:22] what's "the issue" above? [16:09:28] https://phabricator.wikimedia.org/T426323 [16:10:42] Basically, client-side js on xx.wikipedia.org makes calls to api.wikimedia.org with Origin: xx.wikipedia.org. If you go to the same page on yy.wikipedia.org, it makes the same requests with Origin: yy.wikipedia.org, but we respond with a cached entry containing Access-Control-Allow-Origin: xx.wikipedia.org from the previous call [16:10:56] Let me finish commenting on task and I can point you to how to repro easily [16:12:14] yeah I can see it [16:12:37] But I think I may have stumbled upon something new. [16:12:48] curl -v https://api.wikimedia.org/service/lw/recommendation/api/v1/translation/page-collection-groups (without any other context) also gives: [16:13:02] < access-control-allow-origin: https://en.wikipedia.org< access-control-allow-origin: https://en.wikipedia.org [16:13:11] (oops double-paste) [16:13:29] and it's a cache hit [16:13:38] I'm on a cp-node and querying port 3128, so that's ATS right? [16:13:41] headers are part of what's cached [16:13:51] Doing that actually is a cache miss [16:14:27] who is setting the CORS? the caches or the applayer? [16:15:00] The gateway afaict [16:15:04] rest-gateway [16:15:08] so applayer for you :P [16:15:30] yeah [16:15:59] so, it's setting a CORS header on a cacheable object, and setting to whatever domain happened to fetch it when we cached it for everyone [16:16:16] the problem is broader than one user's navigation [16:16:17] cgoubert@cp6012:~$ curl -o /dev/null -s -v -H 'Origin: it.wikipedia.org' -H 'Host: api.wikimedia.org' 'http://api.wikimedia.org:3128/service/lw/recommendation/api/v1/translation/sections?source=es&target=en&count=6' 2>&1 | grep 'X-Cache'; curl -o /dev/null -s -v -H 'Origin: it.wikipedia.org' -H 'Host: api.wikimedia.org' [16:16:19] 'http://api.wikimedia.org:3128/service/lw/recommendation/api/v1/translation/sections?source=es&target=en&count=6' 2>&1 | grep 'X-Cache'; [16:16:21] < X-Cache-Int: cp6012 miss [16:16:23] < X-Cache-Int: cp6012 miss [16:16:52] ok [16:17:19] blblack: yeah, we have an allow origin of '*' in the gateway, but envoy treats that by reflecting the Origin in the returned acao [16:18:02] the frontend varnish is caching the acao [16:18:12] aaaaah [16:18:19] not looking at the right slice of the sandwich [16:19:17] bblack@memex:~/repos/puppet/modules/varnish$ curl -v -H 'Origin: it.wikipedia.org' https://api.wikimedia.org/service/lw/recommendation/api/v1/translation/page-collection-groups 2>&1 |grep -E 'x-cache-status|access-control-allow-origin' [16:19:22] < access-control-allow-origin: https://en.wikipedia.org [16:19:25] < x-cache-status: hit-front [16:19:54] so, this is creating cache entries for everyone, not just the scope of one user's navigation [16:21:36] in general, I don't think we can dynamically scope cacheable acao based on Origin, it doesn't work like that [16:21:55] Yeah I understand that [16:22:03] But if nothing's changed, was it even working before? [16:22:08] probably not [16:22:38] So we have two solutions, basically either not cache these APIs, or shard the cache by Origin, right? [16:22:45] the first-order bug here is api.wikimedia.org returning a cacheable page which contains a header which varies based on the origin [16:22:57] (a header that matters, anyways) [16:23:30] a second-order bug may exist in a higher-level sense, of designing a system in which we thought that would be something reliable [16:26:07] ELI5 - why are we enforcing CORS denial on this in the first place? [16:26:08] By the way, the Cache-Control: no-cache header def says "requires cache to revalidate", what does that revalidation entail? [16:26:34] blblack: idk I just ported the config over to a new gateway, predates me [16:28:03] claime: who's setting 'cache-control: no-cache' at what layer of this? [16:28:19] blblack: envoy is, rest-gateway layer [16:29:47] yeah, something's not being holistically handled well in this case with cache-control, either [16:30:05] because varnish is caching them, and we're not emitting any CC to the user [16:30:22] blblack: no-cache doesn't mean "don't cache" right? [16:30:33] It means revalidate before serving the cached content iiuc? [16:31:06] yes, but still, we're not serving any CC to the user, either [16:31:16] revalidate means IMS or ETag, basically [16:32:57] there's also the whole spaghetti mess of "what the http pseudo-standards think cache-control means", and what each of ATS and Varnish precisely do with them internally, and what our Lua and VCL logic does in parsing them and possibly replacing them or forwarding them for the next layer out does, etc [16:38:52] to be clear: no-cache basically means: check with the origin on every request, but send the origin request as conditional, including an If-Modified-Since or If-None-Match header, giving the backend a chance to repond with a quick 304 Not Modified and re-use the cached bytes we have as a "hit" [16:38:59] at that one layer [16:39:40] so it potentially avoids the transport bandwidth, but does not avoid the latency and dependency with the origin [16:40:03] yeah so, the backend service set acao: * without reflecting Origin [16:40:11] (if I query it directly) [16:41:18] so I'm tempted to just remove the thing that does it in envoy [16:43:13] (although you could make a sort of -ffast-math style optimization argument that for anything for which we have matching cached bytes, we should just use them without checking anything and do the conditional request as a background refresh. given the eventual-consistency of the model of all our clients' view of the world (and really thus the Internet and the surrounding universe) cases which [16:43:19] would break under those conditions probably should've been no-store anyways). [16:43:28] yeah that's not my focus rn :D [16:43:32] sure [16:44:00] but thanks though, because I think I've found a posssible way forward [16:44:06] but someone put the acao there for some reason in the past. chesterton's fence and all. [16:44:41] is there a gaping security hole in this that reflects client input or something? [16:45:12] Gmmm wait [16:46:42] Nah, I thought maybe these routes didn't have it in the old gateway, but they did, and anyway I have the same behaviour with the old gw [16:49:56] ok I missed something but that isn't what's causing the issue. [17:25:06] 06Traffic, 10ContentTranslation, 06LPL Hypothesis, 06Security-Team, and 5 others: CX dashboard can't load page collections on some wikis (blocked by CORS) - https://phabricator.wikimedia.org/T426323#11945825 (10Clement_Goubert) Easy way to repro in browser: - Open dev tools - Go to https://fr.wikipedia.org... [17:28:55] FIRING: VarnishPrometheusExporterDown: Varnish Exporter on instance cp6015:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [17:31:11] 06Traffic: Create an alert for depooled cp hosts - https://phabricator.wikimedia.org/T406641#11945865 (10CDobbins) (proposed) runbook is here: https://wikitech.wikimedia.org/wiki/Traffic/Runbooks/CPHostDepooled I know it's sparse, but I had a hard time thinking of and trying to find ways that someone outside of... [17:50:47] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11945930 (10RobH) [17:52:11] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11945946 (10RobH) [17:57:20] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11945980 (10RobH) Without getting into pricing on this public task the options are: * spend more money (see T426985) to replace the CPU ** we have no money left in expendables for this, so it would... [17:58:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4047&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [18:00:21] uh? [18:03:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4047&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [18:23:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp4046:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4046&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [18:23:54] uh [18:26:14] interesting [18:33:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp4046:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4046&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [18:33:59] FIRING: [2x] HAProxyRestarted: HAProxy server restarted on cp4046:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [18:34:00] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11946092 (10ssingh) >>! In T414411#11945980, @RobH wrote: > Without getting into pricing on this public task the options are: > > * spend more money (see T426985) to replace the CPU > ** we have no... [18:38:44] RESOLVED: [2x] HAProxyRestarted: HAProxy server restarted on cp4046:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [19:01:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp4050:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4050&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [19:01:40] well [19:06:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp4050:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4050&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [19:10:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4047&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [19:11:44] FIRING: [2x] HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [19:15:29] RESOLVED: [2x] HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [19:25:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp4050:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4050&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [19:26:44] FIRING: [2x] HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [19:30:29] FIRING: [2x] HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [19:31:44] RESOLVED: [2x] HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [19:45:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4047&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [19:50:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4047&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [20:14:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4047&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [20:19:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4047&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [20:43:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4047&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [20:48:30] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp4047:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4047&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [21:15:12] 06Traffic, 06SRE, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11946494 (10VRiley-WMF)