[08:47:42] Raine: i'm excited! I hope it will be very boring though :P [09:36:35] Raine: i'm all set... we could start early if you are around. [10:11:15] effie, jayme: are you around? [10:11:57] Raine doesn't seem to be around, and I need an emergency backup for the scheduled api rate limit deployment... [10:12:47] I am [10:13:15] effie: cool, thanks! [10:13:26] ping me if you need anything [10:13:31] and break a leg [10:13:37] I'm about to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1277709 [10:13:51] I have read the backlog :) [10:14:08] excellent. i'll go ahead and merge, then [10:17:18] duesen: I will be looking at https://grafana-rw.wikimedia.org/d/f96f110d-b395-455a-9089-734579ef9b01/mw-square-one-mw-api-ext and https://grafana-rw.wikimedia.org/d/c606e050-b2dd-4cc7-aa5c-6c8979a15f82/cdn-text-square-one from time to time, I reckon we will spot any potential problems [10:18:02] ok thanks [10:20:20] running tests on staging [10:22:55] thanks effie for stepping in [10:24:02] tests pass, I'll apply to codfw [10:27:04] grand [10:27:13] Uh, I'm sorry, I slept through my alarms [10:31:24] Raine: hey, welcome to the exceiting world of rate limit deployments :) [10:31:58] codfw looking good. Moving on to eqiad. [10:31:59] duesen: we are serving errors I am afraid [10:32:26] effie: 429 responses are expected to go up. anything else? [10:32:32] they ar 500x [10:32:59] check the second dashboard I sent you, the cdn one [10:33:15] I'm here as well [10:33:38] ok there was a peak and now it goes down [10:34:05] you mean the spike in "HAproxy 5xx errors"? [10:34:21] yes, shall we give it 1-2 more minutes and then cont? [10:34:31] it looks like it is normalised [10:34:32] sure [10:34:34] in theory mw-api-ext errors wouldn't be something that rest-gateway itself could spike [10:34:38] I'm looking at https://grafana.wikimedia.org/goto/efkg0dboowm4gc?orgId=1 [10:34:52] that doesn't show any 500s comeing from the gateway [10:34:54] unless it was caused by dropped/failed requests due to the rollout or something [10:35:23] s/any/any more than usual/ [10:35:31] it may also be random, I mean, who knows these days [10:35:48] yea, thanks for keeping an eye out! [10:37:24] the "ATS <-> rest-gateway 5xx responses" graph on the gateway dashboard looks fine. And the HAProxy graph looks ok to me as well, thoough the error rate still seems lightly higher than before the spike. [10:38:50] effie: can i move forward? [10:39:11] yeah go ahead [10:39:15] k [10:39:26] I will keep looking where those errors came from [10:43:04] it was randon, it seems like someone was hitting lists.w.o [10:43:39] the errors: [10:43:39] https://logstash.wikimedia.org/app/dashboards#/view/Varnish-Webrequest-50X?_g=(filters:!(),refreshInterval:(pause:!f,value:60000),time:(from:'2026-04-28T10:24:31.903Z',to:'2026-04-28T10:27:24.006Z'))&_a=(description:'HTTP%20requests%20resulting%20in%20errors,%20as%20seen%20by%20Varnish',filters:!(),fullScreenMode:!f,options:(darkTheme:!f),query:(language:lucene,query:'type:webrequest%20AND%20NOT%20(user_agent:%20*bot*%20OR%20user_ag [10:43:39] ent:%20*crawl*%20OR%20user_agent:%20*spider*%20OR%20user_agent:%20*search*)'),timeRestore:!t,title:'Varnish%20Webrequest%2050X',viewMode:view) [10:43:43] ... [10:43:46] sorry for the spam [10:44:14] lists.w.o varnish errors, unrelated: https://logstash.wikimedia.org/goto/6b7ec6aaeaa6c06ebc3c40995e899a0c [10:45:26] thank you for checking! [10:45:32] deployment to eqiad looks good. [10:46:15] I will monitor for the rest of the day. I don't expect operational issues, but we may need to update the rules to address unforeseen consequences. [10:46:32] duesen: on the cdn-text-square-one dashboard, you can see in the second row saying Varnish [10:46:41] "Varnish Requests by http status (exl PURGE)" [10:46:53] you can click on the 4xx and see the 429s trend [10:47:09] https://usercontent.irccloud-cdn.com/file/7hMy5NT3/image.png [10:48:00] nice, thanks Daniel and Effie [10:48:03] cheers [10:48:35] Thank you effie for stepping in <3 [10:48:51] np