[07:11:31] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s: growthexperiments-deleteoldsurveys OOMKilled during June 1st run - https://phabricator.wikimedia.org/T395893#10882676 (10JMeybohm) [08:30:33] 06serviceops, 10Deployments, 13Patch-For-Review, 10Release-Engineering-Team (Radar), 07Wikimedia-production-error: httpb sometimes fails upon deployment with a HTTP 503 - https://phabricator.wikimedia.org/T380958#10882837 (10akosiaris) This is surfacing once every couple of days or so, at least per [Logs... [08:55:04] is someone doing something on the kubestage cluster? I got this error when deploying the new thumbor: https://paste.debian.net/hidden/b8414ad7/ [08:55:53] moritzm: I deploy successfully changeprop-jobqueue a few minutes ago [08:55:58] hiccup? [08:56:21] there's a race condition when the API server is restarted after a new cert IIRC [08:57:32] ok, I just retried and now it worked fine. will keep that in mind [08:57:36] Active: active (running) since Wed 2025-06-04 08:56:38 UTC; 48s ago [08:57:45] sure enough, it got restarted a few seconds ago? [08:58:07] Active: active (running) since Wed 2025-06-04 08:55:36 UTC; 2min 22s ago [08:58:16] first one was for 1003, this one is for 1004 [08:58:28] second one lines up time wise I think [08:58:42] ack, seems so [08:58:55] and ofc kubestage1005 ... Active: active (running) since Wed 2025-06-04 08:54:33 UTC; 4min 6s ago [08:59:04] this is ridiculous, they are all happening around the same time [09:19:05] 06serviceops, 10ChangeProp, 06cloud-services-team, 06collaboration-services, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#10882980 (10hashar) 05Open→03Resolved a:03hashar After chatting with Alexandros, the relic... [11:11:48] 06serviceops, 06LPL Essential: MediaWiki periodic job purge-old-cx-drafts failed - https://phabricator.wikimedia.org/T395892#10883309 (10hnowlan) After amending the dblist, a manual run as pod `purge-old-cx-drafts-202506041039-8942q` has completed successfully. [13:10:44] 06serviceops, 10function-orchestrator, 10Abstract Wikipedia team (25Q4 (Apr–Jun)), 07OKR-Work: Enable memcached in the orchestrator - https://phabricator.wikimedia.org/T391986#10883773 (10cmassaro) Thank you so much! It looks like we can take this for a test run? We'll try it out today. [14:14:48] 06serviceops, 06Abstract Wikipedia team, 10function-evaluator, 10function-orchestrator, and 2 others: Unable to deploy wikifunctions services in production: Pool wf-codfw has no failover servers list, route /local/wf - https://phabricator.wikimedia.org/T396033 (10Jdforrester-WMF) 03NEW [14:27:33] 06serviceops, 06Abstract Wikipedia team, 10function-evaluator, 10function-orchestrator, and 2 others: Unable to deploy wikifunctions services in production: Pool wf-codfw has no failover servers list, route /local/wf - https://phabricator.wikimedia.org/T396033#10884216 (10Joe) p:05Triage→03Unbreak! a:... [14:39:20] 06serviceops, 06Abstract Wikipedia team, 10function-evaluator, 10function-orchestrator, and 3 others: Unable to deploy wikifunctions services in production: Pool wf-codfw has no failover servers list, route /local/wf - https://phabricator.wikimedia.org/T396033#10884284 (10Joe) 05Open→03Resolved The... [15:08:09] 06serviceops, 06Abstract Wikipedia team, 10function-evaluator, 10function-orchestrator, and 3 others: Unable to deploy wikifunctions services in production: Pool wf-codfw has no failover servers list, route /local/wf - https://phabricator.wikimedia.org/T396033#10884429 (10Jdforrester-WMF) >>! In T39603... [15:48:51] akosiaris: that's expected/required depending on which certificate got refreshed. [15:49:26] then the cert for signing service accounts gets refreshed, all other apiservers need to know about the new one. So they all need to restart sooner than later [15:56:41] jayme: ok, but maybe with a bit of more leeway between the restarts? Like a few more minutes? [16:00:06] akosiaris: yeah...it's a tight race I suppose. A token that gets signed after the certificate rotation is considered invalid on the other apiservers as long as they have not restarted [16:00:47] 06serviceops, 06SRE: Silence RESTGatewayBackendErrorsHigh for envoy_cluster_name: mobileapps_cluster - https://phabricator.wikimedia.org/T394609#10884661 (10hnowlan) 05Open→03Resolved a:03hnowlan [16:01:02] but for that a cert refresh has to come together with a token being signed - and the apiserver that has the refreshed cert needs to be the signing authority [16:01:36] so it's probably rare...but also a very sneaky thing to debug when it bites us [16:21:34] 06serviceops, 10Shellbox, 10Wikibase-Quality-Constraints, 10Wikidata, and 2 others: [SW] [WBQC] shellbox-constraints returning 500 on preg_match error - https://phabricator.wikimedia.org/T362084#10884811 (10JMeybohm) 05Open→03Resolved We don't see that constant rate of errors anymore, so I guess th... [16:35:54] 06serviceops, 06Data-Engineering, 06Data-Engineering-Radar, 10Dumps-Generation, 06MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10884949 (10Jdforrester-WMF) [16:39:28] 06serviceops, 06Data-Engineering, 06Data-Engineering-Radar, 10Dumps-Generation, 06MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10884963 (10Jdforrester-WMF) [16:39:51] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s: growthexperiments-deleteoldsurveys OOMKilled during June 1st run - https://phabricator.wikimedia.org/T395893#10884965 (10hnowlan) We have temporarily removed limits and this job has run successfully, however even when running with a relatively... [16:40:15] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s: growthexperiments-deleteoldsurveys OOMKilled during June 1st run - https://phabricator.wikimedia.org/T395893#10884968 (10hnowlan) p:05Triage→03High [17:01:26] 06serviceops, 06LPL Essential: MediaWiki periodic job purge-old-cx-drafts failed - https://phabricator.wikimedia.org/T395892#10885047 (10hnowlan) 05Open→03Resolved a:03hnowlan [17:54:58] 06serviceops, 06Security-Team, 10MW-1.45-notes (1.45.0-wmf.3; 2025-05-27), 13Patch-For-Review, 07SecTeam-Processed: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531#10885317 (10Clement_Goubert) ` extensions/ConfirmEdit/maintenance/GenerateFancyCaptchas.php: Start run Curr... [18:28:07] 06serviceops, 06Security-Team, 10MW-1.45-notes (1.45.0-wmf.3; 2025-05-27), 13Patch-For-Review, 07SecTeam-Processed: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531#10885425 (10Reedy) It is possible it has clashes (because the word list isn’t that big)… And then doesn’t... [20:39:28] 06serviceops, 10Deployments, 10Release-Engineering-Team (Radar), 07Wikimedia-production-error: httpb sometimes fails upon deployment with a HTTP 503 - https://phabricator.wikimedia.org/T380958#10885815 (10cjming) encountered this deploying today - https://spiderpig.wikimedia.org/jobs/154 {F61547159} not... [21:04:45] 06serviceops, 06Abstract Wikipedia team, 10function-orchestrator, 07Essential-Work: Unable to deploy wikifunctions services in production: Helm timeout for prod push of memcached access - https://phabricator.wikimedia.org/T396074 (10Jdforrester-WMF) 03NEW [21:05:49] 06serviceops, 06Abstract Wikipedia team, 10function-orchestrator, 07Essential-Work: Unable to deploy wikifunctions services in production: Helm timeout for prod push of memcached access - https://phabricator.wikimedia.org/T396074#10885896 (10Jdforrester-WMF) (Distinct from T396033, which was fixed earlier...