[00:10:14] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 600.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:22:04] (03PS24) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [00:22:24] (03CR) 10Fabfur: haproxy: certificate check script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [00:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626939 (10phaultfinder) [00:38:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126692 [00:38:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126692 (owner: 10TrainBranchBot) [00:39:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626951 (10phaultfinder) [00:50:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126692 (owner: 10TrainBranchBot) [00:54:58] (03CR) 10Ssingh: "Looks good, mostly questions/nits and no hard blockers IMO." [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [01:08:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126694 [01:08:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126694 (owner: 10TrainBranchBot) [01:27:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126694 (owner: 10TrainBranchBot) [01:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626974 (10phaultfinder) [02:11:10] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:09:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627060 (10phaultfinder) [04:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627118 (10phaultfinder) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:11] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126707 [05:14:28] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126708 [05:28:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:28:26] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:29:18] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:08] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:26 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:16] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0600) [06:06:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627163 (10phaultfinder) [06:21:10] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:36:10] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:40:52] (03PS10) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [06:49:00] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:49:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627185 (10phaultfinder) [07:08:00] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:19:25] (03PS11) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [07:20:34] (03CR) 10CI reject: [V:04-1] mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [07:26:30] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1037.eqiad.wmnet [07:33:38] (03PS1) 10Muehlenhoff: Add astein to authorised Icinga users [puppet] - 10https://gerrit.wikimedia.org/r/1126907 (https://phabricator.wikimedia.org/T388186) [07:38:32] (03CR) 10Muehlenhoff: [C:03+2] Add astein to authorised Icinga users [puppet] - 10https://gerrit.wikimedia.org/r/1126907 (https://phabricator.wikimedia.org/T388186) (owner: 10Muehlenhoff) [07:41:15] (03PS12) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [07:44:56] (03PS13) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [07:45:57] (03CR) 10Filippo Giunchedi: [C:03+2] sqlite: require sqlite::package in 'file' db resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126425 (https://phabricator.wikimedia.org/T387112) (owner: 10Filippo Giunchedi) [07:50:51] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10627205 (10MoritzMuehlenhoff) 05Open→03Resolved @AStein-WMF You should now be able to log into... [07:51:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto) [07:55:27] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2211 - https://phabricator.wikimedia.org/T388295#10627212 (10Marostegui) 05Open→03Resolved Everything looks good, thank you! [07:56:25] (03CR) 10Slyngshede: [C:03+2] Release v0.1.7 [software/bitu] - 10https://gerrit.wikimedia.org/r/1126531 (owner: 10Slyngshede) [07:57:56] (03CR) 10Elukey: [C:03+1] services: update eqiad changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [07:58:19] (03CR) 10Elukey: [C:03+1] services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [07:58:54] (03CR) 10Elukey: [C:03+1] services: update codfw changeprop/changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [08:00:04] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0800). [08:00:05] _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0800) [08:00:44] o/ [08:01:38] <_joe_> hashar: do we need to run the trian now? [08:01:51] <_joe_> it's strange to have such a superposition [08:02:01] (03PS1) 10Slyngshede: Revert "data.yaml temporaily remove SSH key for user" [puppet] - 10https://gerrit.wikimedia.org/r/1126910 [08:02:26] <_joe_> hashar: asking because otherwise I'll merge my changes [08:02:51] I have a ton of mediawiki config change to push [08:02:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2235].codfw.wmnet,db[1176,1217,1228].eqiad.wmnet with reason: m5 master switch T388500 [08:02:55] T388500: Switchover m5 master db1176 -> db1228 - https://phabricator.wikimedia.org/T388500 [08:03:04] the train window overlap cause of daylight saving time confusion [08:03:15] (03Merged) 10jenkins-bot: Release v0.1.7 [software/bitu] - 10https://gerrit.wikimedia.org/r/1126531 (owner: 10Slyngshede) [08:03:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [08:03:23] its tied to Pacific time zone when really it should be tied to Europe :) [08:03:31] jouncebot: refresh [08:03:32] I refreshed my knowledge about deployments. [08:03:35] jouncebot: nowandnext [08:03:35] For the next 0 hour(s) and 56 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0800) [08:03:35] In 0 hour(s) and 56 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [08:03:38] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete custom partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1126570 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:03:47] <_joe_> hashar: I suspected something like that [08:03:48] <_joe_> :D [08:03:50] <_joe_> thanks [08:03:54] <_joe_> can I proceed then? [08:04:08] for what? [08:05:07] I am deploying the patches from https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config#Results [08:06:45] (03PS1) 10Marostegui: mariadb: Promote db1228 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/1126916 (https://phabricator.wikimedia.org/T388500) [08:06:54] (03PS1) 10Muehlenhoff: Remove obsolete custom Partman recipes for labvirt* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1126917 (https://phabricator.wikimedia.org/T156955) [08:08:49] (03PS1) 10Filippo Giunchedi: pontoon: make sure wait-puppet runs as root [puppet] - 10https://gerrit.wikimedia.org/r/1126912 [08:08:53] (03PS1) 10Filippo Giunchedi: pontoon: improve puppetserver git bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1126913 [08:09:20] (03CR) 10Filippo Giunchedi: [C:03+1] "\o/ \o/ \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1126917 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:09:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125372 (owner: 10Hashar) [08:09:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar) [08:09:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar) [08:09:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar) [08:09:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar) [08:09:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125124 (owner: 10Hashar) [08:09:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125210 (owner: 10Reedy) [08:10:15] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: make sure wait-puppet runs as root [puppet] - 10https://gerrit.wikimedia.org/r/1126912 (owner: 10Filippo Giunchedi) [08:10:23] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve puppetserver git bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1126913 (owner: 10Filippo Giunchedi) [08:10:24] <_joe_> hashar: uhm wait [08:10:31] (03Merged) 10jenkins-bot: Remove obsolete $wgAllowMicrodataAttributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125372 (owner: 10Hashar) [08:10:33] (03Merged) 10jenkins-bot: Remove wgArticlePlaceholderSearchIntegrationBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar) [08:10:37] (03Merged) 10jenkins-bot: Remove obsolete CirrusSearch config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar) [08:10:39] (03Merged) 10jenkins-bot: Fix wgCirrusSearchSimilarityProfiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar) [08:10:40] <_joe_> so you're backporting patches that weren't in the schedule before? [08:10:41] (03Merged) 10jenkins-bot: Remove Cognate legacy settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar) [08:11:04] <_joe_> I'd have liked to discuss it [08:11:19] (03Merged) 10jenkins-bot: Remove obsolete $wgFlowMaintenanceMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125124 (owner: 10Hashar) [08:11:20] (03Merged) 10jenkins-bot: InitialiseSettings.php: Remove unused NavigationTiming config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125210 (owner: 10Reedy) [08:11:34] they are all noop cleanup patches, we pushed some of those out of window on thursday [08:11:54] I have considered pushing them on Friday but moved that to Monday instead and forgot I had an appointment [08:12:19] <_joe_> hashar: that's not the point, I had a deployment scheduled, I was verifying a few details about one of the patches before proceeding, you just moved in front of me. It's not really cool, but ok, I'll wait [08:12:20] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1125372|Remove obsolete $wgAllowMicrodataAttributes]], [[gerrit:1125374|Remove wgArticlePlaceholderSearchIntegrationBackend (T207407)]], [[gerrit:1125091|Remove obsolete CirrusSearch config]], [[gerrit:1125092|Fix wgCirrusSearchSimilarityProfiles]], [[gerrit:1125097|Remove Cognate legacy settings (T348526)]], [[gerrit:1125124|Remove obsolete $wgFlowMain [08:12:20] tenanceMode]], [[gerrit:1125210|InitialiseSettings.php: Remove unused NavigationTiming config]] [08:12:21] I went lazy and did not schedule them yesterday since the tuesday morning window was empty yesterday and it is often empty [08:12:26] T207407: Remove legacy Database search integration of ArticlePlaceholder - https://phabricator.wikimedia.org/T207407 [08:12:26] T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526 [08:12:43] <_joe_> hashar: ping me when you're done [08:13:35] ah I see [08:13:38] 06SRE, 06Infrastructure-Foundations: Review Broadcom's storcli binary - https://phabricator.wikimedia.org/T388628 (10elukey) 03NEW [08:13:55] I guess next time I will schedule those so you are not caught off guard last minute [08:14:15] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10627285 (10elukey) Opened T388628 to verify if we can use/import storcli in our apt repo. [08:16:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126910 (owner: 10Slyngshede) [08:16:10] (03PS3) 10Filippo Giunchedi: pontoon: add Host / Filter [puppet] - 10https://gerrit.wikimedia.org/r/1126044 [08:16:17] <_joe_> hashar: it's about waiting in queue appropriately, you know, civil cohexistence and mutual respect. "sorry" was the appropriate response here. In any case, let's move past this before I get even more upset :) [08:16:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet [08:16:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1037.eqiad.wmnet [08:16:24] !log hashar@deploy2002 reedy, hashar: Backport for [[gerrit:1125372|Remove obsolete $wgAllowMicrodataAttributes]], [[gerrit:1125374|Remove wgArticlePlaceholderSearchIntegrationBackend (T207407)]], [[gerrit:1125091|Remove obsolete CirrusSearch config]], [[gerrit:1125092|Fix wgCirrusSearchSimilarityProfiles]], [[gerrit:1125097|Remove Cognate legacy settings (T348526)]], [[gerrit:1125124|Remove obsolete $wgFlowMaintenanceMod [08:16:24] e]], [[gerrit:1125210|InitialiseSettings.php: Remove unused NavigationTiming config]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:16:56] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add Host / Filter [puppet] - 10https://gerrit.wikimedia.org/r/1126044 (owner: 10Filippo Giunchedi) [08:17:22] (03CR) 10Slyngshede: [C:03+2] Revert "data.yaml temporaily remove SSH key for user" [puppet] - 10https://gerrit.wikimedia.org/r/1126910 (owner: 10Slyngshede) [08:19:14] !log hashar@deploy2002 reedy, hashar: Continuing with sync [08:21:17] (03CR) 10Federico Ceratto: [C:03+1] "LGTM Added already-resolved comments. I grepped for db1176 and its ipaddr across other dbproxy* files without finding it." [puppet] - 10https://gerrit.wikimedia.org/r/1126916 (https://phabricator.wikimedia.org/T388500) (owner: 10Marostegui) [08:22:04] (03CR) 10Marostegui: [C:03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1126916 (https://phabricator.wikimedia.org/T388500) (owner: 10Marostegui) [08:24:21] !log Failover m5 from db1176 to db1228 - T388500 [08:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:26] T388500: Switchover m5 master db1176 -> db1228 - https://phabricator.wikimedia.org/T388500 [08:25:09] (03PS2) 10Hashar: Drop CodeEditorEnableCore flag: always true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095 [08:25:20] (03CR) 10Cyndywikime: [C:03+1] Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126533 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [08:25:26] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1125372|Remove obsolete $wgAllowMicrodataAttributes]], [[gerrit:1125374|Remove wgArticlePlaceholderSearchIntegrationBackend (T207407)]], [[gerrit:1125091|Remove obsolete CirrusSearch config]], [[gerrit:1125092|Fix wgCirrusSearchSimilarityProfiles]], [[gerrit:1125097|Remove Cognate legacy settings (T348526)]], [[gerrit:1125124|Remove obsolete $wgFlowMai [08:25:26] ntenanceMode]], [[gerrit:1125210|InitialiseSettings.php: Remove unused NavigationTiming config]] (duration: 13m 06s) [08:25:30] T207407: Remove legacy Database search integration of ArticlePlaceholder - https://phabricator.wikimedia.org/T207407 [08:25:30] T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526 [08:25:47] (03PS4) 10Filippo Giunchedi: pontoon: refactor host filtering with Host / HostFilter [puppet] - 10https://gerrit.wikimedia.org/r/1126045 [08:26:15] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: refactor host filtering with Host / HostFilter [puppet] - 10https://gerrit.wikimedia.org/r/1126045 (owner: 10Filippo Giunchedi) [08:26:30] (03CR) 10Hashar: "My patch went to conflict with I775d9ec67f662ff3f30c097dd828833af86a29fe by @reedy@wikimedia.org . It also removed a duplicate `wfLoadExte" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095 (owner: 10Hashar) [08:26:42] checking logs after the full depoy [08:26:43] deploy [08:27:20] (03PS1) 10Marostegui: db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1126918 [08:27:59] (03CR) 10Marostegui: [C:03+2] db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1126918 (owner: 10Marostegui) [08:28:06] _joe_: it looks all good. And sorry next time I will add them all to the schedule instead of assuming that nobody else would use the window [08:28:15] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1176.eqiad.wmnet [08:28:28] <_joe_> hashar: I was even pinged here... [08:28:31] <_joe_> anyways, ok [08:28:44] <_joe_> proceeding [08:29:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto) [08:30:15] (03Merged) 10jenkins-bot: noc/wiki.php: allow showing a single variable in json format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto) [08:30:46] !log oblivian@deploy2002 Started scap sync-world: Backport for [[gerrit:1123388|noc/wiki.php: allow showing a single variable in json format]] [08:31:16] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629 (10fgiunchedi) 03NEW [08:32:23] (03PS2) 10Filippo Giunchedi: pontoon: add --no-prompt, remove user_confirmation [puppet] - 10https://gerrit.wikimedia.org/r/1126047 [08:32:26] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add --no-prompt, remove user_confirmation [puppet] - 10https://gerrit.wikimedia.org/r/1126047 (owner: 10Filippo Giunchedi) [08:32:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1176.eqiad.wmnet [08:33:33] (03Abandoned) 10Filippo Giunchedi: pontoon: add --no-prompt, remove user_confirmation [puppet] - 10https://gerrit.wikimedia.org/r/1126047 (owner: 10Filippo Giunchedi) [08:33:52] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:1123388|noc/wiki.php: allow showing a single variable in json format]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:33:54] !log oblivian@deploy2002 oblivian: Continuing with sync [08:34:22] (03PS2) 10Filippo Giunchedi: pontoon: improve error messages and new-stack cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1126914 [08:34:45] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve error messages and new-stack cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1126914 (owner: 10Filippo Giunchedi) [08:37:08] (03PS4) 10Filippo Giunchedi: pontoon: integration tests [puppet] - 10https://gerrit.wikimedia.org/r/1126046 [08:37:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1037.eqiad.wmnet with OS bookworm [08:38:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10627344 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1037.eqiad.wmnet with OS bookworm [08:39:28] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: integration tests [puppet] - 10https://gerrit.wikimedia.org/r/1126046 (owner: 10Filippo Giunchedi) [08:40:20] !log oblivian@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123388|noc/wiki.php: allow showing a single variable in json format]] (duration: 09m 34s) [08:41:00] <_joe_> proceeding with the second patch. it will have some small changes happen to things we're running [08:41:03] (03PS1) 10Brouberol: mediawiki-dumps-legacy: restore deployment to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126919 (https://phabricator.wikimedia.org/T388378) [08:41:09] (03PS1) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) [08:41:12] (03PS1) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) [08:41:17] (03PS8) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) [08:41:19] (03PS1) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) [08:42:41] (03PS1) 10Slyngshede: IDM: Switch to host running 0.1.7 [dns] - 10https://gerrit.wikimedia.org/r/1126924 [08:43:43] (03CR) 10Giuseppe Lavagetto: [C:03+2] mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [08:45:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2230.codfw.wmnet,db1125.eqiad.wmnet with reason: Maintenance [08:45:50] (03CR) 10ArielGlenn: [C:03+1] "Thanks for this, looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/1126603 (https://phabricator.wikimedia.org/T388564) (owner: 10Clément Goubert) [08:45:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1176.eqiad.wmnet with reason: Maintenance [08:46:17] (03Merged) 10jenkins-bot: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [08:47:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1126924 (owner: 10Slyngshede) [08:48:09] (03CR) 10Slyngshede: [C:03+2] IDM: Switch to host running 0.1.7 [dns] - 10https://gerrit.wikimedia.org/r/1126924 (owner: 10Slyngshede) [08:48:27] !log slyngshede@dns1004 START - running authdns-update [08:50:34] !log slyngshede@dns1004 END - running authdns-update [08:52:45] !log oblivian@deploy2002 Started scap sync-world: Updating k8s chart [08:53:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:55:05] !log oblivian@deploy2002 Finished scap sync-world: Updating k8s chart (duration: 03m 42s) [08:56:50] <_joe_> uh what's going on with mw-jobrunner? [08:57:54] checking [08:58:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:58:19] (03PS1) 10Marostegui: mariadb: Move db1176 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1126927 (https://phabricator.wikimedia.org/T388630) [08:58:23] (03CR) 10Volans: [C:03+1] "LGTM! Thanks for the addition! I've left some questions and a couple of non-blocking nits. I'll leave to traffic the final approval." [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [08:58:29] (03CR) 10Vgutierrez: haproxy: certificate check script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [08:58:51] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1176 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1126927 (https://phabricator.wikimedia.org/T388630) (owner: 10Marostegui) [08:59:29] saturation since 8:38 [08:59:49] 2 slowdowns before that, normaly due to deploys [09:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [09:00:21] ^ train will be run tonight [09:03:03] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1037.eqiad.wmnet with reason: host reimage [09:04:23] _joe_: something points to something happened at 8:39, but I belive your deploy was after that? [09:06:01] latency increased at 8:21 [09:06:18] https://grafana.wikimedia.org/goto/3rEC1ShHR?orgId=1 [09:06:52] my guess would be at hashar's deployment [09:07:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1037.eqiad.wmnet with reason: host reimage [09:07:14] <_joe_> jynus: yes, it's "organic" [09:07:21] <_joe_> and tbh ok if jobrunners are running hot [09:07:31] <_joe_> as long as it's just "hot" and not "failing" [09:07:34] hmm [09:07:44] just fyi, hashar [09:07:59] all the patches I have pushed are removing unused mediawiki configs and all have been reviewed as doing just that afaik [09:08:15] there seems to be extra load since 8:20 [09:08:28] (03CR) 10Ayounsi: [C:03+2] Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [09:08:29] but I am not ruling out it might have caused some cascading effect somewhere! [09:08:30] <_joe_> which would square up with hashar's deployment [09:08:52] <_joe_> take a look at jobs frequency, I can't spend time on this right now sorry [09:08:59] let me try to find out what the extra work is being spent on [09:10:05] (03Merged) 10jenkins-bot: Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [09:10:05] (03CR) 10Muehlenhoff: "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [09:10:48] there is extra parsoidCacheprewarm, but that doesn't line up with the 8:20 timestamp [09:10:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1125.eqiad.wmnet [09:11:31] the spikes that line up are refreshlinks [09:11:45] but they are not ongoing [09:11:47] (03PS1) 10Marostegui: mariadb: Decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1126931 (https://phabricator.wikimedia.org/T357092) [09:12:27] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1126931 (https://phabricator.wikimedia.org/T357092) (owner: 10Marostegui) [09:13:48] (03CR) 10Vgutierrez: varnish: add log filters to slowquery logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [09:13:57] (03CR) 10Muehlenhoff: [C:03+2] idm: Add approval rule for airflow-search-ops in production [puppet] - 10https://gerrit.wikimedia.org/r/1123665 (owner: 10Muehlenhoff) [09:14:16] (03CR) 10Vgutierrez: [C:03+1] "looking good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1126646 (https://phabricator.wikimedia.org/T388597) (owner: 10BCornwall) [09:16:25] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [09:17:28] (03CR) 10Brouberol: [C:03+2] airflow: fix datahub connection host values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126655 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:17:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:18:53] PROBLEM - Hadoop NodeManager on an-worker1171 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:19:40] (03PS1) 10Filippo Giunchedi: prometheus: move remaining k8s instances to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1126934 (https://phabricator.wikimedia.org/T383232) [09:26:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1037.eqiad.wmnet with OS bookworm [09:27:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10627627 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1037.eqiad.wmnet with OS bookworm completed: - ganeti103... [09:32:37] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1125.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:33:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1125.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:33:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:33:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1125.eqiad.wmnet [09:33:28] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1125.eqiad.wmnet - https://phabricator.wikimedia.org/T357092#10627649 (10Marostegui) a:05Marostegui→03None [09:40:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:42:54] RECOVERY - Hadoop NodeManager on an-worker1171 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:44:23] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1125.eqiad.wmnet - https://phabricator.wikimedia.org/T357092#10627683 (10Marostegui) Ready for #dc-ops [09:44:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [09:45:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:45:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [09:48:45] sadly I belive the alert will return after depoyment is done [09:50:13] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: restore deployment to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126919 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:53:37] !log fio testing on ms-be2088 T384003 [09:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:41] T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 [09:55:28] (03CR) 10Btullis: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:56:17] (03CR) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:57:03] (03CR) 10Btullis: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:57:42] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:57:58] (03CR) 10Btullis: [C:03+1] airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:58:12] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1000) [10:00:15] what [10:00:24] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:00:45] yeah so that is bugged for sure :) [10:00:51] (03CR) 10Marostegui: "We should also remove the master role from its yaml. It can be done here or in a separate patch" [puppet] - 10https://gerrit.wikimedia.org/r/1126042 (owner: 10Jcrespo) [10:00:58] timezones are hard [10:01:15] jouncebot: refresh [10:01:15] I refreshed my knowledge about deployments. [10:01:18] jouncebot: now [10:01:18] For the next 0 hour(s) and 58 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [10:01:19] For the next 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1000) [10:01:38] (03PS2) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) [10:01:38] (03PS2) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) [10:01:38] (03PS2) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) [10:01:59] I will tie it to UTC [10:02:16] (03PS1) 10Slyngshede: P:debmonitor::server remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) [10:02:34] (03CR) 10Brouberol: [C:03+1] Remove docker related referrences on dse-k8s worker and master [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [10:02:57] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:03:21] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5058/console" [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [10:03:22] jouncebot: refresh [10:03:23] I refreshed my knowledge about deployments. [10:03:27] jouncebot: now [10:03:27] For the next 0 hour(s) and 56 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [10:03:27] For the next 0 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1000) [10:04:03] oh because the train window is two hours long! [10:04:42] (03PS8) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:05:16] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [10:05:43] hashar: I have a window now, ok to proceed ? [10:05:56] yeah there is no train this morning [10:05:58] it will run tonight [10:06:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:07:16] (03PS9) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:07:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [10:07:46] (03CR) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [10:07:47] (03PS8) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [10:08:05] (03CR) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [10:08:10] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:08:26] (03PS4) 10Effie Mouzeli: hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845) [10:10:09] (03PS3) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) [10:10:09] (03PS3) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) [10:10:09] (03PS3) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) [10:13:40] !log installing systemd bugfix updates from Bookworm point release [10:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:50] (03CR) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:14:20] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:14:28] !log removing backup1002, backup2002 dump user on es6,es7 T387892 [10:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:31] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [10:15:24] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:17:01] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:17:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet [10:18:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1037.eqiad.wmnet to cluster eqiad and group C [10:19:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1037.eqiad.wmnet to cluster eqiad and group C [10:24:27] (03PS1) 10JMeybohm: global_config: Add kubernetesVersion for each environment/cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T378429) [10:25:44] (03PS1) 10Hashar: Remove obsolete $wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) [10:25:44] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [10:26:26] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [10:27:08] (03CR) 10Hashar: "This is part of removing obsolete settings https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) (owner: 10Hashar) [10:27:55] (03PS1) 10Ayounsi: Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 [10:28:28] (03CR) 10David Caro: [V:03+1 C:03+2] cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:30:25] (03PS2) 10JMeybohm: global_config: Add kubernetesVersion for each environment/cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) [10:31:33] (03CR) 10Filippo Giunchedi: [C:03+1] Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (owner: 10Ayounsi) [10:31:53] (03CR) 10Effie Mouzeli: [C:03+2] mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:33:22] (03Merged) 10jenkins-bot: mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:33:29] (03PS10) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:34:39] (03CR) 10David Caro: "Tested in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:35:46] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:36:16] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:36:18] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:36:42] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm) [10:37:59] (03PS2) 10David Caro: clouceph: enable qos in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) [10:38:05] (03PS11) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:38:25] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:39:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10627787 (10MoritzMuehlenhoff) [10:39:15] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5059/co" [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:41:09] (03CR) 10Clément Goubert: [C:03+2] mediawiki::maintenance: Add backfill_localaccounts periodic jobs [puppet] - 10https://gerrit.wikimedia.org/r/1126603 (https://phabricator.wikimedia.org/T388564) (owner: 10Clément Goubert) [10:41:15] (03CR) 10David Caro: [V:03+1 C:03+2] clouceph: enable qos in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:42:21] !log removing backup1002, backup2002 dbbackups user @ m1 T387892 [10:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:25] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [10:43:37] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:44:18] (03CR) 10Kamila Součková: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm) [10:44:20] (03CR) 10Elukey: "Aaron: I double checked the staging cpu/memory saturation graphs and around the time of your deploy I see a bump:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [10:44:24] !log jiji@deploy2002 Started scap sync-world: (T383845) mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 [10:44:26] (03CR) 10Elukey: services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [10:44:27] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [10:44:31] (03CR) 10Elukey: services: update codfw changeprop/changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [10:47:04] (03PS2) 10Ayounsi: Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) [10:47:43] (03PS12) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:48:01] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:48:26] job runner seems happy again [10:50:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:50:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:51:18] lets wait a little bit [10:51:49] (03CR) 10Jcrespo: [C:04-1] "No worries. Now that I understood the assigment, I will rethink this." [puppet] - 10https://gerrit.wikimedia.org/r/1126042 (owner: 10Jcrespo) [10:51:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:52:07] !incidents [10:52:07] 5724 (ACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [10:52:07] 5726 (ACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [10:52:11] acked [10:52:13] mw-api-int rps are way down [10:52:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 20.69s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:52:28] checking [10:52:34] volans: I am delploying [10:52:35] was there a deploy ongoing? [10:52:37] effie is moving it to php 8.1 [10:52:48] (03PS12) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [10:52:50] should we revert or continue? [10:52:57] effie: ^ [10:53:05] (03CR) 10D3r1ck01: [C:03+1] Remove obsolete $wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) (owner: 10Hashar) [10:53:06] I am mid scap [10:53:09] scap is not done [10:53:14] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:53:14] [for later] the link to the runbook of the page has no content [10:53:21] api seems down [10:53:28] job insertion rate is way down also [10:53:28] scap is going to rollback most likely [10:53:41] did it work on canary? [10:53:47] ok, then let's give it a minute [10:53:51] FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:54:12] (03CR) 10Cathal Mooney: [C:03+1] Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:54:14] latency http errors skyrocketed [10:54:21] https://grafana.wikimedia.org/d/aSiSoKoSk/mw-parsoid?orgId=1 looks pretty bad [10:54:21] volans: I see database errors on mw [10:54:32] [{reqId}] {exception_url} Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server [10:54:37] FIRING: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:54:40] lets go to -sre [10:54:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627861 (10phaultfinder) [10:54:43] parsoid serving a lot of 500s [10:55:12] es overload [10:55:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:55:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:55:24] this is parsoid going crazy overloading content dbs [10:55:25] (03CR) 10Ayounsi: [C:03+2] Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:55:25] (03PS1) 10Lucas Werkmeister (WMDE): Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126949 [10:55:32] please lets move the conversation to -sre, [10:55:42] (03PS1) 10Lucas Werkmeister (WMDE): Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126950 (https://phabricator.wikimedia.org/T369079) [10:55:49] (03PS1) 10Lucas Werkmeister (WMDE): Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126951 [10:56:01] (03PS1) 10Lucas Werkmeister (WMDE): Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126952 (https://phabricator.wikimedia.org/T369079) [10:56:37] (03Merged) 10jenkins-bot: Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:56:39] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10627894 (10cmooney) [10:56:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:57:16] FIRING: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 37.15s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:57:26] RESOLVED: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:57:38] !log jiji@deploy2002 scap failed: 'production' (scap version: 4.140.0) (duration: 13m 54s) [10:58:51] FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:00:01] (03CR) 10Btullis: "Removing the +1 because we are discussing another way to achieve this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [11:00:05] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1100). [11:00:15] RESOLVED: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 23.44% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:02:16] RESOLVED: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 4.716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:03:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:04:11] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126563 (owner: 10PipelineBot) [11:04:18] (03CR) 10JMeybohm: [C:03+2] global_config: Add kubernetesVersion for each environment/cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm) [11:05:30] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126563 (owner: 10PipelineBot) [11:05:37] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:05:55] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:07:20] (03PS9) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) [11:07:46] FIRING: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 4.716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:08:43] (03CR) 10JMeybohm: k8s::client: Allow for install of all kubectl versions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [11:08:46] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [11:09:11] (03PS1) 10Stevemunene: hdfs: create dummy keytabs for new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1126955 (https://phabricator.wikimedia.org/T388512) [11:09:15] jouncebot: now [11:09:15] For the next 0 hour(s) and 50 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1100) [11:09:36] (03PS13) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [11:09:37] FIRING: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:10:30] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:11:18] (03PS1) 10Superpes15: [enwiki] Throttle exemption for event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126956 (https://phabricator.wikimedia.org/T388637) [11:11:26] !log fio testing on ms-be2088 while resetting controller T384003 [11:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:30] T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 [11:11:39] (03PS1) 10Stevemunene: hdfs: Add new worker hosts1[187-208] to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1126957 (https://phabricator.wikimedia.org/T388512) [11:11:41] (03PS1) 10Stevemunene: hdfs: Assign the right role to new hdfs workers 1[187-208] [puppet] - 10https://gerrit.wikimedia.org/r/1126958 (https://phabricator.wikimedia.org/T388512) [11:12:26] RESOLVED: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:13:13] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [11:13:42] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:14:37] FIRING: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:15:30] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:15:56] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:16:15] (03PS1) 10Vgutierrez: cumin: Add liberica aliases per DC [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) [11:16:26] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:16:46] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:17:14] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:17:46] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 4.716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:18:27] !log reimage lvs6003 as a liberica instance - T384477 [11:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:30] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [11:18:32] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125430 (owner: 10PipelineBot) [11:19:02] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs6003 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1125472 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:20:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:21:35] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs6003.drmrs.wmnet with OS bookworm [11:22:26] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:24:42] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:25:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:25:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:25:48] ^^ BGP alert is lvs6003 being reimaged [11:27:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:29:37] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:30:30] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:30:32] task https://phabricator.wikimedia.org/T388646 has been filed for the DBUnexpectedError spike [11:30:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:31:41] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::internal@codfw [11:31:45] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-internal@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123678 (https://phabricator.wikimedia.org/T387318) (owner: 10Vgutierrez) [11:31:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:32:04] !incidents [11:32:05] 5724 (ACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [11:32:05] 5727 (UNACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [11:32:05] 5726 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [11:32:08] !ack 5727 [11:32:09] 5727 (ACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [11:32:46] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 6.317s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:34:37] FIRING: [2x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:35:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:35:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 10.94% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:36:51] (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126963 (https://phabricator.wikimedia.org/T385970) [11:36:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:26] RESOLVED: [2x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:36] (03PS1) 10JMeybohm: deployment_server: Remove special handling of ci user [puppet] - 10https://gerrit.wikimedia.org/r/1126964 (https://phabricator.wikimedia.org/T288629) [11:37:38] (03PS1) 10JMeybohm: helmfile: Dump data about each service (users, namespace etc.) to yaml [puppet] - 10https://gerrit.wikimedia.org/r/1126965 (https://phabricator.wikimedia.org/T378429) [11:37:46] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 2.103s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:37:50] (03PS1) 10Ayounsi: Add Prometheus alert for router interfaces states [alerts] - 10https://gerrit.wikimedia.org/r/1126966 (https://phabricator.wikimedia.org/T388641) [11:38:20] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [11:38:24] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126965 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [11:39:23] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [11:39:23] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::internal@codfw [11:39:29] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage [11:40:30] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 10.94% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:40:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 7.812% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:41:54] (03PS2) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-internal@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123679 (https://phabricator.wikimedia.org/T387318) [11:42:20] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage [11:42:46] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 2.215s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:43:05] (03PS1) 10Ayounsi: gNMIc: restart deamon on config file change [puppet] - 10https://gerrit.wikimedia.org/r/1126968 (https://phabricator.wikimedia.org/T388642) [11:43:30] FIRING: Emergency syslog message: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [11:44:24] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1126968 (https://phabricator.wikimedia.org/T388642) (owner: 10Ayounsi) [11:44:49] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::internal@eqiad [11:45:29] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-internal@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123679 (https://phabricator.wikimedia.org/T387318) (owner: 10Vgutierrez) [11:45:30] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:45:36] (03CR) 10Ayounsi: [C:03+2] gNMIc: restart deamon on config file change [puppet] - 10https://gerrit.wikimedia.org/r/1126968 (https://phabricator.wikimedia.org/T388642) (owner: 10Ayounsi) [11:45:43] (03CR) 10Nikerabbit: AX: Add quick survey for MinT for Wikireaders (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) (owner: 10Abijeet Patro) [11:45:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:45:49] (03PS2) 10JMeybohm: helmfile: Dump data about each service (users, namespace etc.) to yaml [puppet] - 10https://gerrit.wikimedia.org/r/1126965 (https://phabricator.wikimedia.org/T378429) [11:47:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:46] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 5.865s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:48:30] RESOLVED: Emergency syslog message: Device asw1-b12-drmrs.mgmt.drmrs.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [11:48:48] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5060/co" [puppet] - 10https://gerrit.wikimedia.org/r/1126965 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [11:49:33] topranks: ^^ is that expected? [11:49:37] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:49:55] topranks: could be related to the BGP alerts triggered by lvs6003 reimage? [11:50:07] vgutierrez: sorry what in particular? [11:50:19] RESOLVED: Emergency syslog message: Device asw1-b12-drmrs.mgmt.drmrs.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [11:50:20] that one [11:50:27] !log fio testing on ms-be2088 24 disks at once T384003 [11:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 7.812% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:50:31] T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 [11:50:44] vgutierrez: no that normally wouldn't happen on bgp change [11:50:46] hmm. [11:51:31] (03PS2) 10JMeybohm: deployment_server: Remove special handling of ci user [puppet] - 10https://gerrit.wikimedia.org/r/1126964 (https://phabricator.wikimedia.org/T288629) [11:51:42] FIRING: JobUnavailable: Reduced availability for job liberica in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:51:44] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:52:39] topranks: same device though [11:53:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10628069 (10MatthewVernon) I/O definitely pauses during a controller reset (for ~20s). Going to try stressing the disks harder to see if... [11:54:29] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:55:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:55:36] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:55:36] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::internal@eqiad [11:55:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 10.94% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:56:42] RESOLVED: JobUnavailable: Reduced availability for job liberica in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:57:30] vgutierrez: yeah I suspect it's these: [11:57:31] https://logstash.wikimedia.org/goto/71a4c9e2ea26417c13677f7e6d6d362b [11:57:41] I don't think we usually see this when a BGP peer restarts though [11:57:44] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:57:46] RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 3.901s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:58:29] topranks: is that a bgp daemon crash? [11:59:44] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:00:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 14.06% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:02:08] jouncebot: next [12:02:09] In 1 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [12:02:09] In 1 hour(s) and 57 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [12:02:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 6.464s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:02:57] (03CR) 10Clément Goubert: [C:03+1] switchdc: stop and restart crons as part of swithover process (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [12:03:31] (03PS1) 10Vgutierrez: hiera: Fix NIC names for liberica@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1126972 (https://phabricator.wikimedia.org/T384477) [12:04:15] (03CR) 10Vgutierrez: [C:03+2] hiera: Fix NIC names for liberica@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1126972 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [12:04:37] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:05:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 15.62% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:06:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:36] (03PS1) 10Jaime Nuche: test [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126973 [12:06:47] (03PS11) 10Tiziano Fogli: sre.puppet.sync-netbox-hiera: add rack/row to network_devices [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) [12:06:47] (03CR) 10Tiziano Fogli: "I think you're right @ayounsi@wikimedia.org. I've reviewed the patchset to export the PDU data into a new netbox-hiera key (using a dedica" [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [12:07:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 4.579s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:07:26] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:31] 06SRE, 10Phabricator, 07Documentation: Outdated documentation how to request LDAP group membership - https://phabricator.wikimedia.org/T388307#10628098 (10Aklapper) Thank you! [12:09:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:04] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6003.drmrs.wmnet with OS bookworm [12:10:30] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 15.62% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:11:44] (03CR) 10Tiziano Fogli: "Just a reminder for myself: If this patch looks good to you, modules/profile/manifests/netbox/data.pp needs to be adjusted before merging " [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [12:12:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 3.553s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:14:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:15:25] (03PS4) 10Hnowlan: switchdc: stop and restart crons as part of swithover process [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) [12:15:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 20.31% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:16:10] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs6002 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1126974 (https://phabricator.wikimedia.org/T384477) [12:16:39] (03CR) 10Volans: cumin: Add liberica aliases per DC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [12:16:45] (03CR) 10Hnowlan: "Thanks for the reviews - moved the wait function to be invoked per-namespace rather than looping within the function, as that would return" [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [12:17:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 2.92s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:17:23] (03CR) 10Vgutierrez: cumin: Add liberica aliases per DC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [12:17:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:17:36] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126974 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [12:17:55] (03CR) 10Volans: cumin: Add liberica aliases per DC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [12:19:50] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:30] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:20:40] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:51] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10628168 (10phaultfinder) [12:21:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [12:22:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 5.791s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:22:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:22:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [12:23:08] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1034.eqiad.wmnet [12:23:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [12:23:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10628172 (10ops-monitoring-bot) Draining ganeti1034.eqiad.wmnet of running VMs [12:24:37] FIRING: [2x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet [12:24:44] (03PS2) 10Vgutierrez: site,hiera: Reimage lvs6002 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1126974 (https://phabricator.wikimedia.org/T384477) [12:25:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 20.31% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:25:40] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126974 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [12:27:26] RESOLVED: [2x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:27:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [12:27:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10628181 (10ops-monitoring-bot) Draining ganeti1034.eqiad.wmnet of running VMs [12:27:45] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 7.812% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:29:42] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [12:30:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 21.88% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:32:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 5.658s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:32:26] FIRING: [2x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:32:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 21.88% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:34:43] (03PS1) 10Ladsgroup: Bump the thumbnail steps ratio to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126978 (https://phabricator.wikimedia.org/T360589) [12:35:24] (03CR) 10Filippo Giunchedi: [C:03+1] Add Prometheus alert for router interfaces states (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1126966 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:35:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 15.62% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:36:12] (03PS1) 10Máté Szabó: GlobalUserSelectQueryBuilder: Ignore unattached local users [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126979 (https://phabricator.wikimedia.org/T388125) [12:36:25] (03Abandoned) 10Jaime Nuche: test [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126973 (owner: 10Jaime Nuche) [12:36:35] (03PS1) 10Máté Szabó: GlobalUserSelectQueryBuilder: Ignore unattached local users [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126980 (https://phabricator.wikimedia.org/T388125) [12:36:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126979 (https://phabricator.wikimedia.org/T388125) (owner: 10Máté Szabó) [12:37:06] (03PS1) 10Hashar: Remove obsoletes $wgMFNearby and $wgMFNearbyRange [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126981 (https://phabricator.wikimedia.org/T246494) [12:37:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126980 (https://phabricator.wikimedia.org/T388125) (owner: 10Máté Szabó) [12:37:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 6.261s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:37:26] (03PS1) 10Máté Szabó: http: Promote MultiHttpClient warnings to errors [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126982 (https://phabricator.wikimedia.org/T384717) [12:37:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126982 (https://phabricator.wikimedia.org/T384717) (owner: 10Máté Szabó) [12:37:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 10.94% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:38:57] (03PS3) 10Jforrester: Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126659 [12:39:37] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:11] (03PS1) 10Hashar: Remove obsolete $wgMFStripResponsiveImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126984 (https://phabricator.wikimedia.org/T326147) [12:40:30] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 14.06% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:40:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10628257 (10phaultfinder) [12:40:59] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:debmonitor::server remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:42:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 2.193s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:43:56] FIRING: CirrusConsumerCloudelasticFlinkJobNotRunning: ... [12:44:02] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [12:45:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:49:41] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:54:27] (03PS1) 10Ilias Sarantopoulos: (WIP)api_gateway: add editcheck experimental to api-gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126985 (https://phabricator.wikimedia.org/T388269) [12:54:41] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:59:30] (03CR) 10Ssingh: [C:03+1] "Looks good, verified asw1-b13-drmrs." [puppet] - 10https://gerrit.wikimedia.org/r/1126974 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [13:01:54] !log fio testing on ms-be2088 24 disks at once whilst resetting the controller T384003 [13:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:58] T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 [13:02:51] (03PS1) 10Gmodena: cirrus-streaming-updater: reduce SUP parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126988 [13:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 20.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:09:06] !incidents [13:09:07] 5724 (ACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [13:09:07] 5727 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [13:09:07] 5726 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [13:09:07] (03PS1) 10Brouberol: mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) [13:09:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:10:27] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [13:11:14] (03PS2) 10Brouberol: mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) [13:12:31] (03PS2) 10Gmodena: cirrus-streaming-updater: reduce SUP parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126988 [13:12:39] FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:12:52] (03PS3) 10Ilias Sarantopoulos: api_gateway: add editcheck experimental to api-gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126985 (https://phabricator.wikimedia.org/T388269) [13:12:59] (03PS3) 10Brouberol: mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) [13:14:17] (03PS4) 10Brouberol: mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) [13:14:17] (03PS4) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) [13:14:17] (03PS4) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) [13:14:18] (03PS4) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) [13:16:25] (03CR) 10Brouberol: [C:03+1] "Approved by @Ladsgroup@gmail.com on IRC/#-sre as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126988 (owner: 10Gmodena) [13:16:32] (03CR) 10Brouberol: [C:03+2] cirrus-streaming-updater: reduce SUP parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126988 (owner: 10Gmodena) [13:17:39] FIRING: [34x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:18:01] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-03-05-140259 to 2025-03-11-234147 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126995 (https://phabricator.wikimedia.org/T387235) [13:18:04] (03Merged) 10jenkins-bot: cirrus-streaming-updater: reduce SUP parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126988 (owner: 10Gmodena) [13:18:05] (03PS2) 10Ayounsi: Add Prometheus alert for router interfaces states [alerts] - 10https://gerrit.wikimedia.org/r/1126966 (https://phabricator.wikimedia.org/T388641) [13:18:07] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-03-05-140247 to 2025-03-11-234105 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126996 (https://phabricator.wikimedia.org/T381597) [13:18:50] jouncebot: now [13:18:50] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [13:18:54] jouncebot: next [13:18:54] In 0 hour(s) and 41 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [13:18:54] In 0 hour(s) and 41 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [13:19:17] !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:19:22] !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:20:47] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:20:50] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:22:39] FIRING: [35x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:22:45] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:23:44] PROBLEM - Disk space on kafka-logging1004 is CRITICAL: DISK CRITICAL - free space: /srv 156388 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-logging1004&var-datasource=eqiad+prometheus/ops [13:23:56] RESOLVED: CirrusConsumerCloudelasticFlinkJobNotRunning: ... [13:23:56] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [13:24:41] RESOLVED: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:25:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:25:23] (03CR) 10Ssingh: cumin: Add liberica aliases per DC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [13:25:31] (03PS1) 10Effie Mouzeli: Revert "mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126999 [13:25:49] (03PS1) 10Ladsgroup: Temporary reduce category membership change job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127000 [13:27:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 3.81s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:27:39] RESOLVED: [35x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:27:45] RESOLVED: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:28:11] (03CR) 10Ayounsi: [C:03+2] Add Prometheus alert for router interfaces states (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1126966 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:28:37] (03CR) 10Clément Goubert: [C:03+1] Temporary reduce category membership change job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127000 (owner: 10Ladsgroup) [13:29:24] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10628439 (10MoritzMuehlenhoff) [13:29:29] (03CR) 10Marostegui: [C:03+1] Temporary reduce category membership change job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127000 (owner: 10Ladsgroup) [13:29:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 1.562% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:29:42] (03CR) 10Clément Goubert: [C:03+1] Revert "mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126999 (owner: 10Effie Mouzeli) [13:30:33] (03Merged) 10jenkins-bot: Add Prometheus alert for router interfaces states [alerts] - 10https://gerrit.wikimedia.org/r/1126966 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:30:33] (03PS1) 10Effie Mouzeli: Revert "hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2)" [puppet] - 10https://gerrit.wikimedia.org/r/1127006 [13:30:56] (03CR) 10CI reject: [V:04-1] Revert "hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2)" [puppet] - 10https://gerrit.wikimedia.org/r/1127006 (owner: 10Effie Mouzeli) [13:31:01] (03CR) 10Effie Mouzeli: [C:03+2] Revert "mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126999 (owner: 10Effie Mouzeli) [13:31:10] (03CR) 10Clément Goubert: [C:03+1] Revert "hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2)" [puppet] - 10https://gerrit.wikimedia.org/r/1127006 (owner: 10Effie Mouzeli) [13:31:13] (03CR) 10Ladsgroup: [C:03+2] Temporary reduce category membership change job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127000 (owner: 10Ladsgroup) [13:32:12] (03PS2) 10Effie Mouzeli: Revert "hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2)" [puppet] - 10https://gerrit.wikimedia.org/r/1127006 [13:32:15] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 1.256s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:32:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:32:34] (03Merged) 10jenkins-bot: Revert "mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126999 (owner: 10Effie Mouzeli) [13:32:47] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:32:52] !log upgrade doh1001 to dnsdist 1.9.8 [13:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:02] (03CR) 10Effie Mouzeli: [C:03+2] Revert "hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2)" [puppet] - 10https://gerrit.wikimedia.org/r/1127006 (owner: 10Effie Mouzeli) [13:33:20] (03Merged) 10jenkins-bot: Temporary reduce category membership change job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127000 (owner: 10Ladsgroup) [13:33:33] !log upgrade doh2002 to dnsdist 1.9.8 [13:34:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:34:37] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:50] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [13:34:50] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:34:50] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [13:34:50] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:34:50] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [13:34:51] !log ladsgroup@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 23.44% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:36:11] !log ladsgroup@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:36:27] !log ladsgroup@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [13:36:53] !incidents [13:36:53] 5724 (ACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [13:36:54] 5727 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [13:36:54] 5726 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [13:37:29] jouncebot: next [13:37:30] In 0 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [13:37:30] In 0 hour(s) and 22 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [13:37:51] !log ladsgroup@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [13:37:55] Lucas_WMDE: will you be running the backport window? [13:38:16] I’ll be in a meeting for the first half of it so if someone else wants to run it I wouldn’t mind [13:38:27] (I’m aware something™ is going on and would ask before scapping in any case) [13:38:58] I will run scap now, ok thanks [13:39:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 20.31% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:40:01] !log jiji@deploy2002 Started scap sync-world: Reverted 1126607 and 1126650 [13:42:16] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 1.256s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:42:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:43:35] (03PS1) 10Kamila Součková: admin_ng: use the correct helm version for each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127011 (https://phabricator.wikimedia.org/T388390) [13:43:40] (03CR) 10Klausman: [C:03+1] api_gateway: add editcheck experimental to api-gw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126985 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [13:43:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:44:19] !log jiji@deploy2002 Finished scap sync-world: Reverted 1126607 and 1126650 (duration: 04m 57s) [13:44:24] (03PS1) 10Muehlenhoff: Switch ganeti1034 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1127012 [13:44:30] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 22.22% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:44:37] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:45:24] Lucas_WMDE, urbanecm, TheresNoTime whoever is to run the backport window, please check with -sre before doing so [13:45:31] ack [13:47:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 4.977s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:48:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:48:55] (ack) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400). [14:00:05] JSherman, zip, tgr, Lucas_WMDE, Superpes, and mszabo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [14:00:10] moin [14:00:17] here [14:00:21] :) [14:00:21] standing by [14:00:42] I’m in a meeting for the next 30 minutes, so if someone else wants to start deploying… [14:00:45] my change is just config; happy to self deploy [14:00:46] o/ [14:00:47] (also note effie’s comment above) [14:00:55] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:01:17] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: restore deployment to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126919 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:01:20] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:01:49] thanks tgr_: ! [14:01:57] I'm busy for this window, but just to repeat effie's message for those who may have joined after it was sent — "whoever is to run the backport window, please check with -sre before doing so" [14:02:09] I can deploy if we are good to go [14:02:39] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: restore deployment to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126919 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:02:45] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:05:10] effie: Does that also apply to service deployments? [14:06:53] IIUC volans is the right person to answer ^ that question now [14:07:04] since IC’ship was handed over [14:07:12] (03CR) 10Btullis: [C:03+1] hdfs: create dummy keytabs for new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1126955 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:07:15] (03PS1) 10Gmodena: Revert "cirrus-streaming-updater: reduce SUP parallelism" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127013 [14:07:17] I've answered tgr in -sre [14:07:35] I think we're good to go, we declared resolved the incident and also the status page was cleared [14:07:46] and noone else said otherwise :) [14:07:50] Ack. [14:07:52] (03CR) 10Stevemunene: [C:03+2] hdfs: create dummy keytabs for new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1126955 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:08:02] (03CR) 10Stevemunene: [V:03+2 C:03+2] hdfs: create dummy keytabs for new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1126955 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:08:21] (03CR) 10Btullis: [C:03+1] hdfs: Add new worker hosts1[187-208] to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1126957 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:09:32] thanks, I'll start then [14:10:19] (03PS1) 10Brouberol: mediawiki-dumps-legacy: deploy the chart with a role allowed to create role(binding)s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127015 (https://phabricator.wikimedia.org/T388378) [14:10:53] I'll batch the non-scary looking config pateches [14:10:59] (ie. not Flow) [14:11:11] hehe [14:11:17] thanks! [14:11:19] Lucas_WMDE: can the backports go in one scap? [14:11:34] yeah, but I’d like to test them, so let’s see until I’m out of my meeting [14:11:39] but yeah I was planning on one scap for all four [14:11:50] ack [14:11:59] (03CR) 10Brouberol: [C:03+1] Revert "cirrus-streaming-updater: reduce SUP parallelism" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127013 (owner: 10Gmodena) [14:12:19] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: deploy the chart with a role allowed to create role(binding)s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127015 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:12:34] I'll be in a meeting from :30 so happy to hand over then [14:13:33] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluators from 2025-03-05-140259 to 2025-03-11-234147 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126995 (https://phabricator.wikimedia.org/T387235) (owner: 10Jforrester) [14:13:48] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:13:53] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:13:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123435 (https://phabricator.wikimedia.org/T382147) (owner: 10Jsn.sherman) [14:13:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126648 (https://phabricator.wikimedia.org/T384218) (owner: 10Gergő Tisza) [14:14:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126956 (https://phabricator.wikimedia.org/T388637) (owner: 10Superpes15) [14:14:48] (03Merged) 10jenkins-bot: Add MP event stream for MassDelete workflows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123435 (https://phabricator.wikimedia.org/T382147) (owner: 10Jsn.sherman) [14:14:50] (03Merged) 10jenkins-bot: Enable SUL3 signup for 50% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126648 (https://phabricator.wikimedia.org/T384218) (owner: 10Gergő Tisza) [14:14:53] (03Merged) 10jenkins-bot: [enwiki] Throttle exemption for event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126956 (https://phabricator.wikimedia.org/T388637) (owner: 10Superpes15) [14:15:02] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-03-05-140259 to 2025-03-11-234147 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126995 (https://phabricator.wikimedia.org/T387235) (owner: 10Jforrester) [14:15:26] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1123435|Add MP event stream for MassDelete workflows (T382147)]], [[gerrit:1126648|Enable SUL3 signup for 50% of group 2 users (T384218)]], [[gerrit:1126956|[enwiki] Throttle exemption for event (T388637)]] [14:15:33] T382147: Configure a metrics platform stream with a custom schema to record how Nuke users filter pages to delete - https://phabricator.wikimedia.org/T382147 [14:15:33] T384218: SUL3 Phase 2: Staged rollout for all new account creation - https://phabricator.wikimedia.org/T384218 [14:15:33] T388637: Lift of IP Cap for Event: 194.80.232.21 - https://phabricator.wikimedia.org/T388637 [14:15:35] (03PS2) 10Jforrester: wikifunctions: Update orchestrator from 2025-03-05-140247 to 2025-03-11-234105 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126996 (https://phabricator.wikimedia.org/T381597) [14:15:37] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: deploy the chart with a role allowed to create role(binding)s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127015 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:15:56] hey, was mine supposed to be in there? [14:16:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:16:35] no, that seemed more risky [14:16:35] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:16:47] ah, okay [14:17:05] oh, sorry, missed the line when you said you'd batch everything except Flow :D [14:17:17] (03PS13) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [14:17:29] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:17:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:17:58] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:18:48] !log tgr@deploy2002 jsn, tgr, superpes: Backport for [[gerrit:1123435|Add MP event stream for MassDelete workflows (T382147)]], [[gerrit:1126648|Enable SUL3 signup for 50% of group 2 users (T384218)]], [[gerrit:1126956|[enwiki] Throttle exemption for event (T388637)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:18:49] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:18:50] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1187-1199].eqiad.wmnet [14:19:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:19:10] JSherman: Superpes: these patches aren't really testable, right? [14:19:22] tgr_ Exactly :) [14:19:27] Correct [14:19:41] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:19:56] !log tgr@deploy2002 jsn, tgr, superpes: Continuing with sync [14:20:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:20:28] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:20:54] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update orchestrator from 2025-03-05-140247 to 2025-03-11-234105 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126996 (https://phabricator.wikimedia.org/T381597) (owner: 10Jforrester) [14:21:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:22:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:22:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:22:19] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-03-05-140247 to 2025-03-11-234105 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126996 (https://phabricator.wikimedia.org/T381597) (owner: 10Jforrester) [14:22:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:23:13] <_joe_> jouncebot: next [14:23:14] In 2 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1700) [14:23:18] <_joe_> jouncebot: now [14:23:18] For the next 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [14:23:18] For the next 0 hour(s) and 36 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [14:23:29] Overlapping windows, what fun. [14:23:43] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:24:06] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:24:13] should be fine in this case, right? [14:24:30] <_joe_> tgr_: for now, yes [14:24:50] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:25:16] (03PS14) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [14:25:25] (03Abandoned) 10Elukey: WIP: sre.hosts.provision: add bios-mode-flip for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1123381 (owner: 10Elukey) [14:25:34] Yup, I'm not worried about cross-talk. [14:26:09] !log depooling lvs6002 before getting reimaged - T384477 [14:26:10] But we shouldn't really pin the SF morning window to European summer time shift, I suppose? Yay daylight confusion time. [14:26:10] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:12] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [14:26:15] (03PS1) 10Elukey: services: Update Kartotherian's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127023 (https://phabricator.wikimedia.org/T386926) [14:26:18] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:26:31] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123435|Add MP event stream for MassDelete workflows (T382147)]], [[gerrit:1126648|Enable SUL3 signup for 50% of group 2 users (T384218)]], [[gerrit:1126956|[enwiki] Throttle exemption for event (T388637)]] (duration: 11m 04s) [14:26:36] T382147: Configure a metrics platform stream with a custom schema to record how Nuke users filter pages to delete - https://phabricator.wikimedia.org/T382147 [14:26:36] T384218: SUL3 Phase 2: Staged rollout for all new account creation - https://phabricator.wikimedia.org/T384218 [14:26:36] T388637: Lift of IP Cap for Event: 194.80.232.21 - https://phabricator.wikimedia.org/T388637 [14:26:47] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:26:53] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs6002.drmrs.wmnet with reason: depooled before reimage [14:27:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126577 (https://phabricator.wikimedia.org/T383569) (owner: 10Zoe) [14:27:58] (We're now done anyway.) [14:27:59] (03Merged) 10jenkins-bot: Remove Flow as the default talk system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126577 (https://phabricator.wikimedia.org/T383569) (owner: 10Zoe) [14:28:28] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1126577|Remove Flow as the default talk system (T383569)]] [14:28:31] T383569: Set DiscussionTools as default talk pages system at Phase 2b wikis - https://phabricator.wikimedia.org/T383569 [14:28:50] (03CR) 10Elukey: [C:03+2] services: Update Kartotherian's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127023 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [14:29:02] (03PS1) 10Brouberol: Move sidecar controller and the pspClusteRole to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127026 (https://phabricator.wikimedia.org/T388378) [14:29:20] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:29:40] (03CR) 10Gergő Tisza: [C:03+2] Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126949 (owner: 10Lucas Werkmeister (WMDE)) [14:29:41] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:29:42] (03CR) 10Btullis: [C:03+1] Move sidecar controller and the pspClusteRole to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127026 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:29:51] (03CR) 10Gergő Tisza: [C:03+2] Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126950 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [14:29:59] (03CR) 10Gergő Tisza: [C:03+2] Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126951 (owner: 10Lucas Werkmeister (WMDE)) [14:30:08] (03CR) 10Gergő Tisza: [C:03+2] Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126952 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [14:30:19] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:30:45] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs6002 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1126974 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:31:27] !log tgr@deploy2002 zoe, tgr: Backport for [[gerrit:1126577|Remove Flow as the default talk system (T383569)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:31:34] meeting done yay [14:31:35] (03Merged) 10jenkins-bot: Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126949 (owner: 10Lucas Werkmeister (WMDE)) [14:31:37] (03Merged) 10jenkins-bot: Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126950 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [14:31:38] (03Merged) 10jenkins-bot: Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126951 (owner: 10Lucas Werkmeister (WMDE)) [14:32:10] (03Merged) 10jenkins-bot: Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126952 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [14:32:58] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs6002.drmrs.wmnet with OS bookworm [14:33:19] Okay, that's looking good on the four wikis that were defaulting to Flow [14:33:19] zip: do you want to test it? [14:33:26] cool, thanks [14:33:33] !log tgr@deploy2002 zoe, tgr: Continuing with sync [14:34:26] (03CR) 10Brouberol: [C:03+2] Move sidecar controller and the pspClusteRole to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127026 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:34:34] 06SRE, 10observability, 06Traffic: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680 (10ssingh) 03NEW [14:34:37] (03PS1) 10Filippo Giunchedi: pontoon: fix puppet client link in pontoon puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1127027 [14:34:55] 06SRE, 10observability, 06Traffic: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10628842 (10ssingh) p:05Triage→03Medium [14:36:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:36:33] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681 (10Jdforrester-WMF) 03NEW [14:37:05] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix puppet client link in pontoon puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1127027 (owner: 10Filippo Giunchedi) [14:37:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:38:39] (03CR) 10Btullis: [C:03+1] "Looks good. I would probably add a handful of hosts at a time, rather than all 22 at once. You can just disable puppet and re-enable it in" [puppet] - 10https://gerrit.wikimedia.org/r/1126958 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:39:31] (03PS1) 10Filippo Giunchedi: prometheus: cleanup instance functionality [puppet] - 10https://gerrit.wikimedia.org/r/1127029 (https://phabricator.wikimedia.org/T383232) [14:39:34] (03PS1) 10Filippo Giunchedi: hieradata: cleanup k8s-mlstaging from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1127030 (https://phabricator.wikimedia.org/T383232) [14:39:45] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10628884 (10DSantamaria) Approved! [14:40:00] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126577|Remove Flow as the default talk system (T383569)]] (duration: 11m 32s) [14:40:03] (03CR) 10Stevemunene: "Ack, will do 5 at a time. Thanks Ben" [puppet] - 10https://gerrit.wikimedia.org/r/1126958 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:40:03] T383569: Set DiscussionTools as default talk pages system at Phase 2b wikis - https://phabricator.wikimedia.org/T383569 [14:40:39] Lucas_WMDE: over to you [14:40:42] FIRING: JobUnavailable: Reduced availability for job pybal in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:44] ok, thanks! [14:42:25] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1126949|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126950|Replace distinct-values SPARQL queries (T369079)]], [[gerrit:1126951|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126952|Replace distinct-values SPARQL queries (T369079)]] [14:42:29] T369079: Update `UniqueValueChecker` to query a list of endpoints - https://phabricator.wikimedia.org/T369079 [14:43:04] we just had some pybal and BGP alerts for drmrs. Are those expected? [14:43:18] yep, vgutierrez is reimaging [14:43:23] for liberica [14:43:30] thanks, scroll is terrible, even if I searched [14:43:35] sorry for the noise [14:43:40] np [14:44:00] (03CR) 10Volans: [C:04-1] "One minor but easily confusing bug, couple of minor comments inline. LGTM otherwise." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [14:44:00] (03PS1) 10Brouberol: mediawiki-dumps-legacy: grant the orchetrator SA the ability to read pod logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127033 (https://phabricator.wikimedia.org/T388378) [14:44:51] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: grant the orchetrator SA the ability to read pod logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127033 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:45:13] (03PS2) 10Brouberol: mediawiki-dumps-legacy: grant the orchetrator SA the ability to read pod logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127033 (https://phabricator.wikimedia.org/T388378) [14:45:31] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1126949|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126950|Replace distinct-values SPARQL queries (T369079)]], [[gerrit:1126951|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126952|Replace distinct-values SPARQL queries (T369079)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:45:46] testing, one moment… [14:47:34] looking good so far, still testing https://www.wikidata.org/wiki/Special:ConstraintReport/Q4115189 [14:47:40] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: grant the orchetrator SA the ability to read pod logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127033 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:48:46] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:48:49] seems to be all working \o/ [14:49:49] (03PS1) 10Ssingh: wikidough: add healthcheck override for doh1001 and doh2002 [puppet] - 10https://gerrit.wikimedia.org/r/1127039 [14:49:50] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6002.drmrs.wmnet with reason: host reimage [14:50:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:50:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:50:42] RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:51:03] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1127039 (owner: 10Ssingh) [14:52:52] (03PS1) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) [14:53:15] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6002.drmrs.wmnet with reason: host reimage [14:53:33] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::internal_main@codfw [14:53:38] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-internal-main@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123684 (https://phabricator.wikimedia.org/T387319) (owner: 10Vgutierrez) [14:54:04] (03CR) 10CI reject: [V:04-1] Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:54:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10628917 (10phaultfinder)