[00:00:06] (03CR) 10Aaron Schulz: [C:03+2] Update Docker images of staging changeprop services to ones using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124191 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [00:00:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:01:36] (03Merged) 10jenkins-bot: Update Docker images of staging changeprop services to ones using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124191 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [00:01:49] !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host restbase1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:02:59] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:07:24] !log aaron@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [00:07:43] !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host restbase1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:08:42] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:09:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10621677 (10phaultfinder) [00:13:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10621703 (10phaultfinder) [00:18:53] !log aaron@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [00:21:57] !log aaron@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [00:22:48] !log aaron@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [00:40:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126199 [00:40:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126199 (owner: 10TrainBranchBot) [00:49:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10621740 (10phaultfinder) [00:52:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126199 (owner: 10TrainBranchBot) [01:08:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126203 [01:08:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126203 (owner: 10TrainBranchBot) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T0200) [02:00:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_risk_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [02:05:18] !incidents [02:05:18] 5720 (UNACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_risk_cluster api-gateway eqiad) [02:05:19] 5719 (RESOLVED) [2x] GatewayBackendErrorsHigh sre (api-gateway eqiad) [02:05:19] 5717 (RESOLVED) db1152 (paged)/MariaDB read only ms1 (paged) [02:05:26] !ack 5720 [02:05:27] 5720 (ACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_risk_cluster api-gateway eqiad) [02:05:36] RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_risk_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [02:08:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.20 [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126207 (https://phabricator.wikimedia.org/T386215) [02:08:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.20 [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126207 (https://phabricator.wikimedia.org/T386215) (owner: 10TrainBranchBot) [02:13:06] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126203 (owner: 10TrainBranchBot) [02:19:51] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.20 [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126207 (https://phabricator.wikimedia.org/T386215) (owner: 10TrainBranchBot) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T0300) [03:02:14] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126208 (https://phabricator.wikimedia.org/T386215) [03:02:15] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126208 (https://phabricator.wikimedia.org/T386215) (owner: 10TrainBranchBot) [03:03:03] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126208 (https://phabricator.wikimedia.org/T386215) (owner: 10TrainBranchBot) [03:03:27] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.20 refs T386215 [03:03:31] T386215: 1.44.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T386215 [03:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:23:23] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:39:05] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:44:07] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [03:52:40] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.20 refs T386215 (duration: 49m 13s) [03:52:51] T386215: 1.44.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T386215 [03:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:59:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:00:06] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T0400) [04:03:04] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.17 (duration: 03m 02s) [04:13:23] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:54:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10622076 (10phaultfinder) [04:59:27] (03PS1) 10Aaron Schulz: services: update eqiad changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) [04:59:29] (03PS1) 10Aaron Schulz: services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) [04:59:30] (03PS1) 10Aaron Schulz: services: update codfw changeprop/changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) [05:18:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 23.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:23:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 22.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:38:48] (03PS1) 10Abijeet Patro: EventLogging: Improve handling when suggestions are not present [extensions/Translate] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126220 (https://phabricator.wikimedia.org/T388467) [05:39:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/Translate] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126220 (https://phabricator.wikimedia.org/T388467) (owner: 10Abijeet Patro) [05:39:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/Translate] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126220 (https://phabricator.wikimedia.org/T388467) (owner: 10Abijeet Patro) [05:39:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/Translate] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126220 (https://phabricator.wikimedia.org/T388467) (owner: 10Abijeet Patro) [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T0600) [06:00:05] marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T0600) [06:36:47] (03CR) 10Marostegui: [C:03+2] mariadb: remove RT GRANTs for m1 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126121 (https://phabricator.wikimedia.org/T388437) (owner: 10Dzahn) [06:44:59] !log Remove rt grants from m1 T388437 [06:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:02] T388437: drop RT database and mysql grants - https://phabricator.wikimedia.org/T388437 [06:45:41] !log Drop rt database from m1 T388437 [06:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:40] (03PS1) 10Marostegui: dbproxy1023,dbproxy1025: Test db1164 [puppet] - 10https://gerrit.wikimedia.org/r/1126413 (https://phabricator.wikimedia.org/T388396) [06:52:17] (03CR) 10Marostegui: [C:03+2] dbproxy1023,dbproxy1025: Test db1164 [puppet] - 10https://gerrit.wikimedia.org/r/1126413 (https://phabricator.wikimedia.org/T388396) (owner: 10Marostegui) [06:53:23] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:55:33] (03PS1) 10Marostegui: Revert "dbproxy1023,dbproxy1025: Test db1164" [puppet] - 10https://gerrit.wikimedia.org/r/1126419 [06:57:10] (03CR) 10Marostegui: [C:03+2] Revert "dbproxy1023,dbproxy1025: Test db1164" [puppet] - 10https://gerrit.wikimedia.org/r/1126419 (owner: 10Marostegui) [07:00:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2233].codfw.wmnet,db[1164,1217,1228].eqiad.wmnet with reason: Primary switchover m2 T388396 [07:00:14] T388396: Switchover m2 master db1228 -> db1164 - https://phabricator.wikimedia.org/T388396 [07:01:24] 88 [07:02:09] (03PS1) 10Marostegui: mariadb: Promote db1164 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1126424 (https://phabricator.wikimedia.org/T388396) [07:02:57] (03PS1) 10Filippo Giunchedi: sqlite: require sqlite::package in 'file' db resource [puppet] - 10https://gerrit.wikimedia.org/r/1126425 (https://phabricator.wikimedia.org/T387112) [07:03:15] (03PS2) 10Marostegui: mariadb: Promote db1164 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1126424 (https://phabricator.wikimedia.org/T388396) [07:13:03] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1164 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1126424 (https://phabricator.wikimedia.org/T388396) (owner: 10Marostegui) [07:13:42] !log Failover m2 from db1228 to db1164 - T388396 [07:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:45] T388396: Switchover m2 master db1228 -> db1164 - https://phabricator.wikimedia.org/T388396 [07:17:21] (03CR) 10Vgutierrez: [C:04-1] "you can make this script work independently of volatile directories, there is no need of depending on that" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [07:19:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1228.eqiad.wmnet with reason: Maintenance [07:19:24] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1228.eqiad.wmnet [07:23:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [07:23:23] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:23:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1228.eqiad.wmnet [07:25:47] (03PS1) 10Marostegui: mariadb: Move db1228 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/1126460 (https://phabricator.wikimedia.org/T388496) [07:26:10] (03CR) 10Filippo Giunchedi: [C:03+1] alert: Remove stale vops-bot-sync-db* service [puppet] - 10https://gerrit.wikimedia.org/r/1126128 (https://phabricator.wikimedia.org/T388444) (owner: 10Andrea Denisse) [07:33:24] (03PS1) 10Jon Harald Søby: Add uca collation for Kazakh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126483 (https://phabricator.wikimedia.org/T384395) [07:33:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126483 (https://phabricator.wikimedia.org/T384395) (owner: 10Jon Harald Søby) [07:48:26] (03PS1) 10Filippo Giunchedi: icinga: route relforge icinga alerts to data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1126486 (https://phabricator.wikimedia.org/T388270) [07:54:50] (03PS1) 10Slyngshede: data.yaml temporaily remove SSH key for user [puppet] - 10https://gerrit.wikimedia.org/r/1126487 [07:59:01] (03CR) 10Slyngshede: "I'll ping the user and help them get a replacement key added." [puppet] - 10https://gerrit.wikimedia.org/r/1126487 (owner: 10Slyngshede) [07:59:26] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1228 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/1126460 (https://phabricator.wikimedia.org/T388496) (owner: 10Marostegui) [07:59:30] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126103 (https://phabricator.wikimedia.org/T386232) (owner: 10JMeybohm) [07:59:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1228.eqiad.wmnet with reason: Maintenance [08:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T0800). [08:00:05] abijeet and Jhs: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: Maintenance [08:00:38] 👋 [08:00:42] I can deploy abijeet's patch [08:01:02] hello hello, apologies for being late. [08:01:17] No problem. It is just few seconds ;) [08:02:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 8 hosts with reason: Cloning [08:03:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/Translate] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126220 (https://phabricator.wikimedia.org/T388467) (owner: 10Abijeet Patro) [08:03:22] I should have +2 before :/ [08:04:10] (03CR) 10Jelto: [C:03+1] "one question in line" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126103 (https://phabricator.wikimedia.org/T386232) (owner: 10JMeybohm) [08:04:21] kart_, since my patch is also language-related, you could count that as work too 😜 [08:04:29] (03Merged) 10jenkins-bot: EventLogging: Improve handling when suggestions are not present [extensions/Translate] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126220 (https://phabricator.wikimedia.org/T388467) (owner: 10Abijeet Patro) [08:04:38] Jhs, :-D [08:05:28] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1126220|EventLogging: Improve handling when suggestions are not present (T388467)]] [08:05:32] T388467: TypeError: mtSuggestions.map is not a function - https://phabricator.wikimedia.org/T388467 [08:07:53] (03CR) 10Jelto: [C:03+1] "lgtm. There are some more `In the past it was ...` comments. Why did you remove some and not all of them?" [puppet] - 10https://gerrit.wikimedia.org/r/1126105 (https://phabricator.wikimedia.org/T386232) (owner: 10JMeybohm) [08:08:03] That's fast merge! [08:08:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126487 (owner: 10Slyngshede) [08:08:45] !log installing systemd bugfix updates from Bookworm point release [08:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:06] Jhs: sure [08:10:30] kart_, i was half kidding, you don't need to if you have other stuff to do. :) My patch also needs a script run in addition to the merge, so it's not necessarily straight-forward [08:12:03] Jhs: can you add the instruction in the deployment page as well? That will be helpful. [08:12:30] !log kartik@deploy2002 abi, kartik: Backport for [[gerrit:1126220|EventLogging: Improve handling when suggestions are not present (T388467)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:12:34] T388467: TypeError: mtSuggestions.map is not a function - https://phabricator.wikimedia.org/T388467 [08:13:14] abijeet: You can test the patch now.. [08:13:23] kart_, yup, i did just a couple minutes ago: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=2281410&oldid=2281402 [08:14:25] kart_, ok [08:15:52] Jhs: thanks. Let's try that :) I've not run script since a long though. [08:17:48] kart_, updateCollation has a --dry-run option if you wish to do that one first: https://www.mediawiki.org/wiki/Manual:UpdateCollation.php [08:18:06] sure [08:18:06] This script is normally very fast for small wikis (kkwiki is a small wiki in this context) [08:18:39] abijeet: is all good? [08:20:10] kart_, the error is also present in wmf.19, might have to backport there. [08:22:19] (03PS1) 10Abijeet Patro: EventLogging: Improve handling when suggestions are not present [extensions/Translate] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126489 (https://phabricator.wikimedia.org/T388467) [08:22:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/Translate] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126489 (https://phabricator.wikimedia.org/T388467) (owner: 10Abijeet Patro) [08:22:48] abijeet: but should we go ahead with this patch? [08:23:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [08:23:11] kart_, yea, lets do that [08:23:15] cool [08:23:18] !log kartik@deploy2002 abi, kartik: Continuing with sync [08:24:08] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10622422 (10MoritzMuehlenhoff) @AStein-WMF Turns out we don't actually need your SSH key for this access request. Instead, ple... [08:28:27] (03PS1) 10Muehlenhoff: Drop lvs-eqsin alias [puppet] - 10https://gerrit.wikimedia.org/r/1126490 [08:30:51] (03CR) 10Slyngshede: [C:03+2] data.yaml temporaily remove SSH key for user [puppet] - 10https://gerrit.wikimedia.org/r/1126487 (owner: 10Slyngshede) [08:32:07] (03CR) 10Volans: [C:03+2] query: do not error on no match in first subquery [software/cumin] - 10https://gerrit.wikimedia.org/r/1125158 (owner: 10Volans) [08:32:20] (03CR) 10Volans: [C:03+2] "Actually we need to wait to migrate to trixie+ to get rid of this." [software/cumin] - 10https://gerrit.wikimedia.org/r/1125157 (owner: 10Volans) [08:32:24] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126220|EventLogging: Improve handling when suggestions are not present (T388467)]] (duration: 26m 56s) [08:32:28] T388467: TypeError: mtSuggestions.map is not a function - https://phabricator.wikimedia.org/T388467 [08:33:29] abijeet: I'll do deployment of 2nd patch after Jhs's patch. [08:34:00] Jhs: deploying your patch now.. [08:34:07] kart_, 👍 [08:34:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126483 (https://phabricator.wikimedia.org/T384395) (owner: 10Jon Harald Søby) [08:35:06] (03Merged) 10jenkins-bot: Add uca collation for Kazakh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126483 (https://phabricator.wikimedia.org/T384395) (owner: 10Jon Harald Søby) [08:35:37] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1126483|Add uca collation for Kazakh (T384395)]] [08:35:40] T384395: Adding Uppercase and lowercase collation for Kazakh language - https://phabricator.wikimedia.org/T384395 [08:36:59] (03CR) 10Volans: "recheck" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125954 (owner: 10Volans) [08:37:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [08:37:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [08:37:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [08:37:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [08:38:37] !log kartik@deploy2002 kartik, jhsoby: Backport for [[gerrit:1126483|Add uca collation for Kazakh (T384395)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:39:22] Jhs: we can't test on testservers until script is run, right? [08:40:47] kart_, correct, the changes most likely won't be visible until the script is run [08:41:26] ok. going ahead. [08:41:29] !log kartik@deploy2002 kartik, jhsoby: Continuing with sync [08:42:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [08:43:23] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:43:38] (03PS1) 10Slyngshede: Revert^2 "P:firewall absent conntrack_table_size monitoring." [puppet] - 10https://gerrit.wikimedia.org/r/1126492 [08:46:54] (03PS4) 10JMeybohm: admin_ng: Change staging-codfw pod ip range to 10.192.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126103 (https://phabricator.wikimedia.org/T386232) [08:47:28] (03CR) 10Slyngshede: [C:03+2] Revert^2 "P:firewall absent conntrack_table_size monitoring." [puppet] - 10https://gerrit.wikimedia.org/r/1126492 (owner: 10Slyngshede) [08:47:50] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126483|Add uca collation for Kazakh (T384395)]] (duration: 12m 13s) [08:47:54] T384395: Adding Uppercase and lowercase collation for Kazakh language - https://phabricator.wikimedia.org/T384395 [08:48:01] (03Merged) 10jenkins-bot: query: do not error on no match in first subquery [software/cumin] - 10https://gerrit.wikimedia.org/r/1125158 (owner: 10Volans) [08:48:30] (03CR) 10JMeybohm: admin_ng: Change staging-codfw pod ip range to 10.192.64.0/21 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126103 (https://phabricator.wikimedia.org/T386232) (owner: 10JMeybohm) [08:48:54] (03CR) 10JMeybohm: admin_ng: Change staging-codfw pod ip range to 10.192.64.0/21 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126103 (https://phabricator.wikimedia.org/T386232) (owner: 10JMeybohm) [08:49:04] (03Merged) 10jenkins-bot: docs: removed deprecated call to sphinx_rtd_theme [software/cumin] - 10https://gerrit.wikimedia.org/r/1125157 (owner: 10Volans) [08:50:13] Jhs: deployment done. Running with --dry-run first. [08:50:38] nice! [08:50:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet [08:50:51] Now running actual one :) [08:51:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10622493 (10ops-monitoring-bot) Draining ganeti1037.eqiad.wmnet of running VMs [08:51:17] (03PS2) 10Volans: tests: remove unnecessary vulture setting [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125956 [08:51:45] (03PS1) 10Muehlenhoff: Switch ganeti1037 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1126494 [08:51:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2144.codfw.wmnet,db1151.eqiad.wmnet with reason: Maintenance [08:52:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1037.eqiad.wmnet [08:53:11] abijeet: Deploying 2nd patch in a few minutes.. [08:56:22] kart_, ok [08:57:33] (03PS1) 10Volans: cookbook: make the default argument parser tunable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1126498 [08:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10622526 (10phaultfinder) [08:59:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet [09:00:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10622527 (10ops-monitoring-bot) Draining ganeti1037.eqiad.wmnet of running VMs [09:02:02] (03PS1) 10Slyngshede: Revert^3 "P:firewall absent conntrack_table_size monitoring." [puppet] - 10https://gerrit.wikimedia.org/r/1126500 [09:02:06] (03CR) 10Federico Ceratto: "Setting 2 comments as resolved." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto) [09:03:23] (03CR) 10Volans: "Given many cookbooks add a reason and/or a task CLI argument, I've made this proposal to simplify its addition, allowing also to choose if" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1126498 (owner: 10Volans) [09:04:46] (03CR) 10Slyngshede: [C:03+2] Revert^3 "P:firewall absent conntrack_table_size monitoring." [puppet] - 10https://gerrit.wikimedia.org/r/1126500 (owner: 10Slyngshede) [09:06:31] (03CR) 10Federico Ceratto: "Added CLI option for externalLoads" [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [09:06:43] Jhs: script still running.. [09:07:05] (03CR) 10Federico Ceratto: Ask for confirmation before depooling last host in a group (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [09:08:06] kart_, ok, thanks for letting me know (and sorry it's taking so long, normally it's quicker) [09:10:57] (03PS5) 10Federico Ceratto: Ask for confirmation before depooling last host in a group [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) [09:13:24] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:14:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1126425 (https://phabricator.wikimedia.org/T387112) (owner: 10Filippo Giunchedi) [09:18:32] (03CR) 10Fabfur: [C:03+1] site,hiera: Reimage lvs6003 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1125472 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:20:23] (03CR) 10Fabfur: "[question] so now to target liberica lbs in single DCs there is no "direct" alias?" [puppet] - 10https://gerrit.wikimedia.org/r/1125162 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:20:32] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1125162 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:20:40] (03CR) 10Jelto: [C:03+1] "lgtm now" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126103 (https://phabricator.wikimedia.org/T386232) (owner: 10JMeybohm) [09:21:05] (03PS1) 10Slyngshede: P:firewall absent conntrack_table_size monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1126503 (https://phabricator.wikimedia.org/T350694) [09:21:08] (03CR) 10Vgutierrez: [C:03+2] "it will be added on a following CR" [puppet] - 10https://gerrit.wikimedia.org/r/1125162 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:22:11] (03CR) 10Muehlenhoff: "We can also add them at this point, when the liberica alias was added, there was just the experimental nodes on the old eqiad hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1125162 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:22:36] (03CR) 10Vgutierrez: "duplicated, see Idc4403b23c1121053edc6da9a96a6b50650ed3ff" [puppet] - 10https://gerrit.wikimedia.org/r/1126490 (owner: 10Muehlenhoff) [09:24:03] (03PS2) 10JMeybohm: Update wikikube-staging codfw pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1126105 (https://phabricator.wikimedia.org/T386232) [09:24:05] (03CR) 10Muehlenhoff: "All great minds think alike :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1126490 (owner: 10Muehlenhoff) [09:24:08] (03Abandoned) 10Muehlenhoff: Drop lvs-eqsin alias [puppet] - 10https://gerrit.wikimedia.org/r/1126490 (owner: 10Muehlenhoff) [09:24:15] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: fix typo in rbac [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126076 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:24:44] (03CR) 10Jelto: [C:03+1] "lgtm now!" [puppet] - 10https://gerrit.wikimedia.org/r/1126105 (https://phabricator.wikimedia.org/T386232) (owner: 10JMeybohm) [09:24:54] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1125529 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [09:25:14] Not sure why it is taking such long time, Jhs [09:25:47] sorry about that :/ [09:26:58] (03CR) 10Filippo Giunchedi: [C:03+1] P:firewall absent conntrack_table_size monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1126503 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:27:18] dry-run was super fast :D [09:27:28] (03CR) 10Filippo Giunchedi: [C:03+2] sqlite: require sqlite::package in 'file' db resource [puppet] - 10https://gerrit.wikimedia.org/r/1126425 (https://phabricator.wikimedia.org/T387112) (owner: 10Filippo Giunchedi) [09:28:42] (03PS1) 10JMeybohm: admin_ng: Create cert-manager leases in cert-manager namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125462 (https://phabricator.wikimedia.org/T383553) [09:28:46] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-For-Review: sqlite::db can get stuck on zero byte file database - https://phabricator.wikimedia.org/T387112#10622569 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Optimistically resolving, I'll report back if I see this again [09:28:46] kart_, sounds like it got stuck somehow (i'm not seeing the expected changes either). Maybe try to exit the script and start again (if that's safe)? [09:28:59] !log jelto@cumin1002 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster staging-codfw: Kubernetes upgrade [09:29:17] It is still running like, [09:29:17] `Selecting next 100 pages from cl_from = 435300... processing... 1559258 done. [09:29:17] Selecting next 100 pages from cl_from = 435400... processing... 1559758 done.` [09:29:37] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:29:39] aha [09:29:52] kkwiki has so many pages? [09:29:54] kart_, was the change deployed, or is it still on wmdebug? [09:30:05] Jhs: change is deployed. [09:30:06] ~250,000 according to Special:Statistics [09:30:38] sorry, that's articles. ~650,000 pages [09:31:29] let's wait for sometime.. [09:32:05] (03CR) 10JMeybohm: [C:03+2] admin_ng: Create cert-manager leases in cert-manager namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125462 (https://phabricator.wikimedia.org/T383553) (owner: 10JMeybohm) [09:32:19] (03CR) 10JMeybohm: [C:03+2] admin_ng: Change staging-codfw pod ip range to 10.192.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126103 (https://phabricator.wikimedia.org/T386232) (owner: 10JMeybohm) [09:32:19] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubestagemaster_6443: Servers kubestagemaster2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:32:23] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps2006.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [09:32:28] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps1006.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [09:32:33] (03CR) 10JMeybohm: [C:03+2] Update wikikube-staging codfw pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1126105 (https://phabricator.wikimedia.org/T386232) (owner: 10JMeybohm) [09:32:35] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubestagemaster_6443: Servers kubestagemaster2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:33:40] ^ this pybal alert is expected because of maintenance in staging-codfw wikikube cluster - T384450 [09:33:40] T384450: Update wikikube-staging-codfw to kubernetes 1.31 - https://phabricator.wikimedia.org/T384450 [09:34:05] (03CR) 10Muehlenhoff: "Looks good, two comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1115375 (owner: 10Slyngshede) [09:34:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10622599 (10phaultfinder) [09:35:22] kart_, oh, something happened; the category I'm using to check suddenly got updated (all at once): https://kk.wikipedia.org/wiki/%D0%A1%D0%B0%D0%BD%D0%B0%D1%82:%D2%9A%D0%B0%D0%B7%D0%B0%D2%9B_%D0%B6%D0%B0%D0%B7%D1%83%D1%8B [09:35:29] So yeah, let's just leave it to finish even if it takes time :) [09:37:06] (03Merged) 10jenkins-bot: admin_ng: Create cert-manager leases in cert-manager namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125462 (https://phabricator.wikimedia.org/T383553) (owner: 10JMeybohm) [09:37:07] (03Merged) 10jenkins-bot: admin_ng: Change staging-codfw pod ip range to 10.192.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126103 (https://phabricator.wikimedia.org/T386232) (owner: 10JMeybohm) [09:37:22] sure [09:38:04] (03PS1) 10Brouberol: mediawiki-dumps-legacy: add missing cpu limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126505 (https://phabricator.wikimedia.org/T388378) [09:38:27] (03CR) 10Fabfur: haproxy: use TLS tmpfiles and add certificate check script (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [09:38:50] (03PS10) 10Fabfur: haproxy: use TLS tmpfiles and add certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [09:39:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/Translate] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126489 (https://phabricator.wikimedia.org/T388467) (owner: 10Abijeet Patro) [09:40:35] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:40:57] Jhs: finally done! 1983673 rows processed [09:41:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:41:15] !log Script run: `mwscript updateCollation.php --wiki=kkwiki --previous-collation=uppercase` (T384395) [09:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:18] T384395: Adding Uppercase and lowercase collation for Kazakh language - https://phabricator.wikimedia.org/T384395 [09:41:53] Thanks Jhs. This was a bit new experience for me :) [09:42:22] wohoo! Thank you very much, kart_! The categories I've used for checking look correct (they didn't before this change), so I think it's all good :) [09:42:34] (03CR) 10Fabfur: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [09:42:55] Nice! [09:44:35] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2003.codfw.wmnet, kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:46:21] (03PS11) 10Fabfur: haproxy: use TLS tmpfiles and add certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [09:46:38] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: add missing cpu limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126505 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:47:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:48:42] !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [09:48:48] !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [09:50:53] (03CR) 10Fabfur: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [09:51:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:52:59] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.liberica-admin depooling P{lvs4010.ulsfo.wmnet} and A:liberica [09:52:59] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.liberica-admin (exit_code=1) depooling P{lvs4010.ulsfo.wmnet} and A:liberica [09:54:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [09:54:33] (03CR) 10Giuseppe Lavagetto: [C:04-1] "A couple of things here:" [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) (owner: 10Bartosz Dziewoński) [09:55:01] !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [09:55:03] (03CR) 10Giuseppe Lavagetto: [C:04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) (owner: 10Bartosz Dziewoński) [09:55:03] (03CR) 10Giuseppe Lavagetto: [C:04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) (owner: 10Bartosz Dziewoński) [09:55:07] !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [09:55:52] (03PS5) 10Bartosz Dziewoński: MediaWiki: Only proxy existing .php files, otherwise return nice 404 [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) [09:57:40] !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [09:57:52] !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [09:58:25] !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [09:58:37] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: add missing cpu limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126505 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:59:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1000) [10:01:14] !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [10:01:56] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10622706 (10MoritzMuehlenhoff) [10:02:25] RESOLVED: SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:43] (03PS3) 10Slyngshede: Permissions LDAP group validator [software/bitu] - 10https://gerrit.wikimedia.org/r/1115375 [10:05:50] (03PS1) 10David Caro: tools-legacy-redirector: use a custom event_mpm config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) [10:05:51] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.wipe-cluster (exit_code=99) Wipe the K8s cluster staging-codfw: Kubernetes upgrade [10:05:58] (03CR) 10Slyngshede: Permissions LDAP group validator (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1115375 (owner: 10Slyngshede) [10:06:02] (03PS7) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [10:06:03] (03PS7) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [10:06:03] (03PS6) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [10:06:04] (03PS7) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [10:06:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10622716 (10elukey) I think that we have something working! Starting point: ` PD LIST : ======= -------------------------------------... [10:06:20] !log jelto@cumin1002 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster staging-codfw: Kubernetes upgrade [10:06:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one final typo inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1115375 (owner: 10Slyngshede) [10:07:31] (03CR) 10CI reject: [V:04-1] Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 (owner: 10Giuseppe Lavagetto) [10:07:38] (03CR) 10CI reject: [V:04-1] mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto) [10:07:40] (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [10:07:54] (03PS1) 10JMeybohm: admin_ng: Fix dependencies of istio-gateways-networkpolicies release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126512 (https://phabricator.wikimedia.org/T341984) [10:08:00] (03CR) 10CI reject: [V:04-1] tools-legacy-redirector: use a custom event_mpm config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) (owner: 10David Caro) [10:08:31] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126512 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:08:33] (03CR) 10Kamila Součková: [C:03+1] admin_ng: Fix dependencies of istio-gateways-networkpolicies release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126512 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:09:26] (03PS2) 10David Caro: tools-legacy-redirector: use a custom event_mpm config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) [10:11:34] (03CR) 10Elukey: "@fceratto@wikimedia.org I see that the code review has already been reviewed by Scott (that has way more context than me) but I have a few" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto) [10:11:39] (03CR) 10CI reject: [V:04-1] tools-legacy-redirector: use a custom event_mpm config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) (owner: 10David Caro) [10:12:27] (03PS3) 10David Caro: tools-legacy-redirector: use a custom mpm_event config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) [10:12:45] (03CR) 10Giuseppe Lavagetto: mediawiki: introduce feature flags (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [10:13:44] (03CR) 10JMeybohm: [V:03+2 C:03+2] admin_ng: Fix dependencies of istio-gateways-networkpolicies release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126512 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:14:40] (03CR) 10CI reject: [V:04-1] tools-legacy-redirector: use a custom mpm_event config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) (owner: 10David Caro) [10:14:48] !log dcausse@deploy2002 Started deploy [airflow-dags/search@c27621d]: publish search artifacts [10:15:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:15:16] (03PS4) 10David Caro: tools-legacy-redirector: use a custom mpm_event config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) [10:15:18] !log dcausse@deploy2002 Finished deploy [airflow-dags/search@c27621d]: publish search artifacts (duration: 00m 29s) [10:15:48] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10622774 (10MoritzMuehlenhoff) [10:17:00] 10SRE-swift-storage, 06Commons, 06serviceops: Commons thumbnails are broken for certain large sizes of thumbnail images - https://phabricator.wikimedia.org/T358738#10622775 (10jijiki) 05Open→03Resolved a:03jijiki [10:17:14] jouncebot: nowandnext [10:17:14] For the next 0 hour(s) and 42 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1000) [10:17:14] In 1 hour(s) and 42 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1200) [10:17:26] (03CR) 10CI reject: [V:04-1] tools-legacy-redirector: use a custom mpm_event config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) (owner: 10David Caro) [10:18:42] !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [10:19:01] (03PS5) 10David Caro: tools-legacy-redirector: use a custom mpm_event config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) [10:19:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10622781 (10phaultfinder) [10:20:13] 06SRE, 06Infrastructure-Foundations, 10Packaging: Package php-ast in {stretch,buster}-wikimedia/component - https://phabricator.wikimedia.org/T280210#10622795 (10jijiki) [10:20:44] (03CR) 10Vgutierrez: haproxy: use TLS tmpfiles and add certificate check script (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [10:21:27] (03PS8) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [10:21:28] (03PS7) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [10:21:28] (03PS8) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [10:21:41] !log Deploy schema change on s4 testcommonswiki codfw master with replication dbmaint T385917 [10:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:44] (03CR) 10David Caro: tools-legacy-redirector: use a custom mpm_event config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) (owner: 10David Caro) [10:21:44] T385917: Deploy patch-gjlw_namespace_text.sql on x1.commonswiki for JsonConfig - https://phabricator.wikimedia.org/T385917 [10:21:48] (03CR) 10David Caro: [C:04-1] tools-legacy-redirector: use a custom mpm_event config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) (owner: 10David Caro) [10:22:30] (03PS6) 10David Caro: tools-legacy-redirector: use a custom mpm_event config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) [10:22:43] !log Deploy schema change on x1 commonswiki codfw master with replication dbmaint T385917 [10:22:43] (03CR) 10Cathal Mooney: [C:03+2] Add new switches eqiad racks E8/F8 [homer/public] - 10https://gerrit.wikimedia.org/r/1126136 (https://phabricator.wikimedia.org/T382017) (owner: 10Cathal Mooney) [10:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:56] (03CR) 10CI reject: [V:04-1] mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto) [10:22:56] (03CR) 10CI reject: [V:04-1] Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 (owner: 10Giuseppe Lavagetto) [10:23:00] (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [10:23:05] (03CR) 10Muehlenhoff: "Let's not introduce random new insetup roles, we already have insetup roles for every team, so one of them should be used instead" [puppet] - 10https://gerrit.wikimedia.org/r/1123434 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [10:23:16] (03Merged) 10jenkins-bot: Add new switches eqiad racks E8/F8 [homer/public] - 10https://gerrit.wikimedia.org/r/1126136 (https://phabricator.wikimedia.org/T382017) (owner: 10Cathal Mooney) [10:25:19] (03PS7) 10David Caro: tools-legacy-redirector: use a custom mpm_event config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) [10:25:30] (03PS1) 10Filippo Giunchedi: o11y: fix PrometheusLowRetention expression [alerts] - 10https://gerrit.wikimedia.org/r/1126513 (https://phabricator.wikimedia.org/T388504) [10:26:09] (03CR) 10David Caro: tools-legacy-redirector: use a custom mpm_event config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) (owner: 10David Caro) [10:27:59] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10622824 (10Ladsgroup) Progress update: Now 1% of production thumbnails (and 100% of test wikis) are using thumbnail steps which means they are going to b... [10:28:03] (03CR) 10Tiziano Fogli: [C:03+1] o11y: fix PrometheusLowRetention expression [alerts] - 10https://gerrit.wikimedia.org/r/1126513 (https://phabricator.wikimedia.org/T388504) (owner: 10Filippo Giunchedi) [10:28:17] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10622826 (10Ladsgroup) a:03Ladsgroup I guess I'm doing this. [10:28:56] (03CR) 10Filippo Giunchedi: [C:03+2] o11y: fix PrometheusLowRetention expression [alerts] - 10https://gerrit.wikimedia.org/r/1126513 (https://phabricator.wikimedia.org/T388504) (owner: 10Filippo Giunchedi) [10:29:59] (03PS1) 10Ladsgroup: FileModule: Normalize file paths for deps tracked from CSSMin [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126515 (https://phabricator.wikimedia.org/T388323) [10:30:15] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1253 gradually with 4 steps - Pool in for T385141 [10:30:19] T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141 [10:30:58] (03CR) 10David Caro: [V:03+1] "Deployed and tested in toolsbeta and tools 👍" [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) (owner: 10David Caro) [10:31:50] (03CR) 10Fabfur: haproxy: use TLS tmpfiles and add certificate check script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [10:31:59] (03PS1) 10Ladsgroup: Bump thumbnail steps to 2% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126516 (https://phabricator.wikimedia.org/T360589) [10:32:21] (03PS12) 10Fabfur: haproxy: use TLS tmpfiles and add certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [10:33:14] (03CR) 10Fabfur: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [10:34:34] (03PS13) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [10:35:46] (03PS1) 10Fabfur: cache,haproxy: use parametrized tmpfiles cert dir [puppet] - 10https://gerrit.wikimedia.org/r/1126517 [10:36:16] !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [10:36:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126084 (https://phabricator.wikimedia.org/T382069) (owner: 10Jforrester) [10:36:32] (03CR) 10FNegri: [C:03+1] tools-legacy-redirector: use a custom mpm_event config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) (owner: 10David Caro) [10:37:12] (03CR) 10Lucas Werkmeister (WMDE): "No occurrences of those settings left in wmf.19/wmf.20:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118487 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [10:37:13] (03Merged) 10jenkins-bot: Stop loading the ActiveAbstract extension for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126084 (https://phabricator.wikimedia.org/T382069) (owner: 10Jforrester) [10:37:34] (03PS14) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [10:37:41] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1126084|Stop loading the ActiveAbstract extension for dumps (T382069)]] [10:37:44] T382069: Undeploy and archive ActiveAbstract - https://phabricator.wikimedia.org/T382069 [10:40:02] (03PS8) 10David Caro: tools-legacy-redirector: use a custom mpm_event config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) [10:40:02] (03CR) 10David Caro: tools-legacy-redirector: use a custom mpm_event config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) (owner: 10David Caro) [10:40:11] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM. I also double-checked the prefixes in netbox." [homer/public] - 10https://gerrit.wikimedia.org/r/1126035 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [10:40:49] (03PS9) 10David Caro: tools-legacy-redirector: use a custom mpm_event config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) [10:41:19] !log installing openjdk 17 security updates on puppet servers (the necessary restarts may cause a few interrupted puppet runs and will be splayed out) [10:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:39] (03CR) 10Elukey: "From the Python point of view it LGTM, there is a bit of repetition in the error messages (the code I mean) but it is probably clearer to " [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [10:42:11] (03PS1) 10JMeybohm: admin_ng: Add dependency from calico to namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126519 (https://phabricator.wikimedia.org/T341984) [10:42:23] (03CR) 10Marostegui: [C:03+1] "From my side this looks good too." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [10:43:20] (03CR) 10Federico Ceratto: [C:03+1] Ask for confirmation before depooling last host in a group [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [10:43:22] (03CR) 10Federico Ceratto: [C:03+2] Ask for confirmation before depooling last host in a group [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [10:43:37] (03CR) 10Vgutierrez: haproxy: certificate check script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [10:45:02] (03PS2) 10JMeybohm: admin_ng: Add dependency from calico to namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126519 (https://phabricator.wikimedia.org/T341984) [10:45:26] (03CR) 10Kamila Součková: [C:03+1] admin_ng: Add dependency from calico to namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126519 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:45:39] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126519 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:45:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10622890 (10phaultfinder) [10:46:19] (03PS1) 10David Caro: cloud.puppetserver: ensure git is an alias of pgit [puppet] - 10https://gerrit.wikimedia.org/r/1126520 [10:46:40] (03CR) 10CI reject: [V:04-1] cloud.puppetserver: ensure git is an alias of pgit [puppet] - 10https://gerrit.wikimedia.org/r/1126520 (owner: 10David Caro) [10:46:54] (03CR) 10David Caro: [C:03+2] tools-legacy-redirector: use a custom mpm_event config [puppet] - 10https://gerrit.wikimedia.org/r/1126511 (https://phabricator.wikimedia.org/T385908) (owner: 10David Caro) [10:48:02] (03PS9) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [10:48:03] (03PS8) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [10:48:03] (03PS9) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [10:48:09] (03PS2) 10David Caro: cloud.puppetserver: ensure git is an alias of pgit [puppet] - 10https://gerrit.wikimedia.org/r/1126520 [10:48:22] PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/14b37db3c80c16b6d2c413c08e769365d5737573dcf4a419e82ea4ed583833e2/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [10:49:13] This seems to be an issue [10:49:33] (03CR) 10CI reject: [V:04-1] mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto) [10:49:37] (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [10:49:56] effie: It's stuck in "10:39:41 K8s images build/push output redirected to /home/ladsgroup/scap-image-build-and-push-log" [10:50:03] and then this error [10:50:16] probably some clean up is needed in deploy2002? [10:50:56] Amir1: do you have a tmux I can attach to? [10:50:57] (03Merged) 10jenkins-bot: Ask for confirmation before depooling last host in a group [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [10:51:12] (03PS3) 10David Caro: cloud.puppetserver: ensure git is an alias of pgit [puppet] - 10https://gerrit.wikimedia.org/r/1126520 [10:51:20] effie: the screen I have, sudo -u ladsgroup screen -r? [10:51:28] ok [10:51:48] (03CR) 10JMeybohm: [V:03+2 C:03+2] admin_ng: Add dependency from calico to namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126519 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:52:37] (03PS4) 10David Caro: cloud.puppetserver: ensure git is an alias of pgit [puppet] - 10https://gerrit.wikimedia.org/r/1126520 [10:52:53] Amir1: just give it some time mate [10:52:59] it is still with us [10:53:02] !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [10:53:08] the alert seems to be because of this: [10:53:12] (03CR) 10Elukey: [C:03+1] puppetdb: add support for structured facts [software/cumin] - 10https://gerrit.wikimedia.org/r/1125974 (https://phabricator.wikimedia.org/T372666) (owner: 10Volans) [10:53:15] > /dev/mapper/vg0-srv 277G 213G 50G 82% /srv [10:53:31] (might be relared, might not) [10:54:07] while it appears to, it is not related. The way the check is performed, it is trying to access an fs that is not there anymore [10:54:19] (03Abandoned) 10Ladsgroup: mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1119157 (https://phabricator.wikimedia.org/T386213) (owner: 10Gerrit maintenance bot) [10:54:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:54:27] (03Abandoned) 10Ladsgroup: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1119158 (https://phabricator.wikimedia.org/T386213) (owner: 10Gerrit maintenance bot) [10:54:54] !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [10:55:19] (03CR) 10David Caro: "Tested in toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/1126520 (owner: 10David Caro) [10:55:23] (03CR) 10David Caro: [V:03+1] cloud.puppetserver: ensure git is an alias of pgit [puppet] - 10https://gerrit.wikimedia.org/r/1126520 (owner: 10David Caro) [10:55:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:55:43] (03PS1) 10Ilias Sarantopoulos: (WIP) api-gateway: change hosts for reference-risk/need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126523 (https://phabricator.wikimedia.org/T387019) [10:56:52] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps1007.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [10:56:58] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps2007.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [10:58:30] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1126520 (owner: 10David Caro) [10:59:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:59:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:02:58] (03CR) 10FNegri: cloud.puppetserver: ensure git is an alias of pgit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126520 (owner: 10David Caro) [11:03:50] (03PS1) 10Clément Goubert: P:parsoid::testing: Set wikimedia-servergroup [puppet] - 10https://gerrit.wikimedia.org/r/1126525 (https://phabricator.wikimedia.org/T388465) [11:05:19] !log ladsgroup@deploy2002 ladsgroup, jforrester: Backport for [[gerrit:1126084|Stop loading the ActiveAbstract extension for dumps (T382069)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:05:22] !log ladsgroup@deploy2002 ladsgroup, jforrester: Continuing with sync [11:05:24] T382069: Undeploy and archive ActiveAbstract - https://phabricator.wikimedia.org/T382069 [11:05:40] (03CR) 10Elukey: [C:03+1] cookbook: make the default argument parser tunable (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1126498 (owner: 10Volans) [11:08:22] RECOVERY - Disk space on deploy2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [11:09:23] (03PS1) 10Tiziano Fogli: prometheus/ext: recover original retention_time [puppet] - 10https://gerrit.wikimedia.org/r/1126526 (https://phabricator.wikimedia.org/T388504) [11:09:52] (03CR) 10Kosta Harlan: "This seems to have broken the mediawiki_job_purge_loginnotify and mediawiki_job_purge_temporary_accounts maintenance scripts, which used '" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083493 (owner: 10Majavah) [11:09:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10622967 (10cmooney) @VRiley-WMF I've fixed up the scs links in netbox now, moving the cables to the new devices. To set the boxes to 'active' (needed b... [11:10:01] jouncebot: nowandnext [11:10:01] No deployments scheduled for the next 0 hour(s) and 49 minute(s) [11:10:01] In 0 hour(s) and 49 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1200) [11:10:08] (03PS1) 10Brouberol: airflow-analytics-test: enable sidecar job controller and the mediawiki PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126527 (https://phabricator.wikimedia.org/T388378) [11:10:24] (03CR) 10Ladsgroup: "I will fix it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083493 (owner: 10Majavah) [11:10:26] 06SRE, 10Bitu, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938#10622971 (10Arendpieter) [11:10:48] (03PS1) 10Filippo Giunchedi: prometheus: move k8s-mlstaging to prometheus2008 [puppet] - 10https://gerrit.wikimedia.org/r/1126528 (https://phabricator.wikimedia.org/T383232) [11:11:21] (03CR) 10Btullis: [C:03+1] airflow-analytics-test: enable sidecar job controller and the mediawiki PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126527 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [11:11:22] (03CR) 10Kosta Harlan: "Thanks! Can you please tag T388125 with the fix?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083493 (owner: 10Majavah) [11:11:35] @Amir1, @effie: I would like to run a maintenance script that cleans up some db-rows of a GrowthExperiments database on frwiki. Can I do that or is there something going on and I should wait? [11:12:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/main at codfw: 24.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:14:14] MichaelG_WMF: from my side, go ahead [11:14:22] thanks! [11:15:02] 06SRE, 06Infrastructure-Foundations, 10Packaging, 06serviceops: Package php-ast in {stretch,buster}-wikimedia/component - https://phabricator.wikimedia.org/T280210#10622983 (10hashar) 05Open→03Resolved a:03Legoktm **TLDR: this was solved by @Legoktm in October 2021** I am adding back #serviceops... [11:15:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:15:24] (03PS15) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [11:15:58] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1253 gradually with 4 steps - Pool in for T385141 [11:16:02] T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141 [11:16:10] hm, codfw not looking great [11:16:19] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-e8-eqiad [11:16:28] (03CR) 10Fabfur: haproxy: certificate check script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [11:17:02] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: enable sidecar job controller and the mediawiki PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126527 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [11:17:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 8.333% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:17:36] (03PS1) 10Ladsgroup: mediawiki: Remove references to nonglobal dblist [puppet] - 10https://gerrit.wikimedia.org/r/1126529 (https://phabricator.wikimedia.org/T388125) [11:18:08] (03CR) 10Kosta Harlan: [C:03+1] mediawiki: Remove references to nonglobal dblist [puppet] - 10https://gerrit.wikimedia.org/r/1126529 (https://phabricator.wikimedia.org/T388125) (owner: 10Ladsgroup) [11:18:12] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus/ext: recover original retention_time [puppet] - 10https://gerrit.wikimedia.org/r/1126526 (https://phabricator.wikimedia.org/T388504) (owner: 10Tiziano Fogli) [11:18:35] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad [11:18:51] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/ext: recover original retention_time [puppet] - 10https://gerrit.wikimedia.org/r/1126526 (https://phabricator.wikimedia.org/T388504) (owner: 10Tiziano Fogli) [11:19:04] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad [11:19:32] !log migr@mwmaint2002: ran "time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=frwiki --db-table --verbose --force 2>&1 | tee ~/frwiki-dbtable.txt" [11:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:52] (03PS4) 10Slyngshede: Permissions LDAP group validator [software/bitu] - 10https://gerrit.wikimedia.org/r/1115375 [11:20:08] (03PS2) 10Ladsgroup: mediawiki: Remove references to nonglobal dblist [puppet] - 10https://gerrit.wikimedia.org/r/1126529 (https://phabricator.wikimedia.org/T388125) [11:20:11] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Remove references to nonglobal dblist [puppet] - 10https://gerrit.wikimedia.org/r/1126529 (https://phabricator.wikimedia.org/T388125) (owner: 10Ladsgroup) [11:20:20] (03CR) 10Slyngshede: Permissions LDAP group validator (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1115375 (owner: 10Slyngshede) [11:20:45] (03CR) 10Fabfur: haproxy: certificate check script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [11:20:46] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [11:21:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f8-eqiad [11:21:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10623041 (10cmooney) [11:22:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 2.778% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:23:48] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add new network switches - cmooney@cumin1002 - T382017" [11:23:52] T382017: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017 [11:24:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add new network switches - cmooney@cumin1002 - T382017" [11:24:13] (03CR) 10Vgutierrez: [C:04-1] haproxy: certificate check script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [11:24:31] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps2008.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [11:24:38] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps1008.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [11:25:25] (03PS1) 10Slyngshede: Release v0.1.7 [software/bitu] - 10https://gerrit.wikimedia.org/r/1126531 [11:26:40] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10Thumbor, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10623053 (10PantheraLeo1359531) Yes, thank you for that hint; I try to collect the information. Here is another... [11:26:57] (03PS5) 10Vgutierrez: sre.loadbalancer: Add liberica-admin cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) [11:27:15] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:28:15] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.wipe-cluster (exit_code=99) Wipe the K8s cluster staging-codfw: Kubernetes upgrade [11:29:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:29:23] (03PS3) 10Federico Ceratto: Implement Icinga notification check before pooling in a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) [11:29:23] (03CR) 10Federico Ceratto: "e2e tested successfully with test-cookbook -c 1126040 sre.mysql.pool db1253 -r "Pool in for T385141"" [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [11:30:00] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1126084|Stop loading the ActiveAbstract extension for dumps (T382069)]] [11:30:03] T382069: Undeploy and archive ActiveAbstract - https://phabricator.wikimedia.org/T382069 [11:30:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10623075 (10cmooney) @robh did Myriad send us the license files for these two new QFX5120-48Y switches? [11:31:17] !log enable connections from ssw1-e1 and ssw1-f1 to new top-of-rack switches lsw1-e8 and lsw1-f8 in eqiad T382017 [11:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:20] T382017: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017 [11:32:54] (03PS16) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [11:33:30] (03PS4) 10Federico Ceratto: Implement Icinga notification check before pooling in a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) [11:34:22] (03CR) 10Fabfur: haproxy: certificate check script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [11:34:49] !log ladsgroup@deploy2002 ladsgroup, jforrester: Backport for [[gerrit:1126084|Stop loading the ActiveAbstract extension for dumps (T382069)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:34:52] !log ladsgroup@deploy2002 ladsgroup, jforrester: Continuing with sync [11:36:59] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [11:37:53] 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 13Patch-For-Review: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480#10623104 (10fgiunchedi) 05Resolved→03Open Reopening as we are not done here... [11:38:41] 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 13Patch-For-Review: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480#10623109 (10SLyngshede-WMF) a:05SLyngshede-WMF→03None [11:39:37] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:41:05] !log dropping transcache table everywhere (T376627) [11:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:12] T376627: Drop ad-hoc or obsolete tables in production - https://phabricator.wikimedia.org/T376627 [11:42:27] Amir1: that table is going away?? nice [11:43:12] That table is dropped in 13 releases ago https://www.mediawiki.org/wiki/Manual:Transcache_table [11:43:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:43:37] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126084|Stop loading the ActiveAbstract extension for dumps (T382069)]] (duration: 13m 36s) [11:43:38] I can't find any code relying on it either https://codesearch.wmcloud.org/deployed/?q=transcache&files=&excludeFiles=&repos= [11:43:40] T382069: Undeploy and archive ActiveAbstract - https://phabricator.wikimedia.org/T382069 [11:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10623132 (10phaultfinder) [11:45:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126515 (https://phabricator.wikimedia.org/T388323) (owner: 10Ladsgroup) [11:47:54] generally speaking, I hope the number of files opened by s3 mariadb is normal/healthy [11:48:33] (03PS17) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [11:48:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Preparing db1254 for T385141', diff saved to https://phabricator.wikimedia.org/P74183 and previous config saved to /var/cache/conftool/dbconfig/20250311-114835-fceratto.json [11:48:39] T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141 [11:49:57] (03Merged) 10jenkins-bot: FileModule: Normalize file paths for deps tracked from CSSMin [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126515 (https://phabricator.wikimedia.org/T388323) (owner: 10Ladsgroup) [11:50:13] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1254 gradually with 4 steps - Pool in for T385141 [11:50:26] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1126515|FileModule: Normalize file paths for deps tracked from CSSMin (T388323)]] [11:50:30] T388323: ResourceLoaderModule-dependencies writes the exact same value to database multiple times every second - https://phabricator.wikimedia.org/T388323 [11:50:37] 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 13Patch-For-Review: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480#10623140 (10fgiunchedi) [11:50:52] (03PS1) 10Cathal Mooney: Add base monitoring elements for new top-of-rack switches eqiad E8/F8 [puppet] - 10https://gerrit.wikimedia.org/r/1126534 (https://phabricator.wikimedia.org/T382017) [11:51:04] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [11:52:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10623146 (10MatthewVernon) OK, so the disk pulled was `sdl`: ` Mar 10 16:45:30 ms-be2088 kernel: [267287.723999] megaraid_sas 0000:98:00... [11:53:09] (03PS2) 10Cathal Mooney: Add base monitoring elements for new top-of-rack switches eqiad E8/F8 [puppet] - 10https://gerrit.wikimedia.org/r/1126534 (https://phabricator.wikimedia.org/T382017) [11:55:26] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1126515|FileModule: Normalize file paths for deps tracked from CSSMin (T388323)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:55:30] T388323: ResourceLoaderModule-dependencies writes the exact same value to database multiple times every second - https://phabricator.wikimedia.org/T388323 [11:55:41] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1200) [12:01:00] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:01:59] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126525 (https://phabricator.wikimedia.org/T388465) (owner: 10Clément Goubert) [12:02:24] (03CR) 10Slyngshede: [C:03+2] Permissions LDAP group validator [software/bitu] - 10https://gerrit.wikimedia.org/r/1115375 (owner: 10Slyngshede) [12:02:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:41] (03PS1) 10Hashar: Pin types-setuptools to support pkg_resources [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1126536 [12:04:03] (03CR) 10Hashar: "Found via https://github.com/typeshed-internal/stub_uploader/blob/main/data/changelogs/setuptools.md" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1126536 (owner: 10Hashar) [12:04:08] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126515|FileModule: Normalize file paths for deps tracked from CSSMin (T388323)]] (duration: 13m 41s) [12:04:12] T388323: ResourceLoaderModule-dependencies writes the exact same value to database multiple times every second - https://phabricator.wikimedia.org/T388323 [12:05:48] (03PS2) 10Michael Große: Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126533 (https://phabricator.wikimedia.org/T386250) [12:05:48] (03CR) 10Michael Große: "to be deployed on Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126533 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [12:06:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126516 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [12:07:07] vgutierrez: fabfur Emperor: Thumbnail steps going to 2% [12:07:08] (03CR) 10CI reject: [V:04-1] Pin types-setuptools to support pkg_resources [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1126536 (owner: 10Hashar) [12:07:30] 👀 [12:07:30] (03Merged) 10jenkins-bot: Bump thumbnail steps to 2% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126516 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [12:08:00] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1126516|Bump thumbnail steps to 2% (T360589)]] [12:08:04] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [12:09:51] !log jiji@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kafka-main1001.eqiad.wmnet with reason: decom [12:10:38] (03Merged) 10jenkins-bot: Permissions LDAP group validator [software/bitu] - 10https://gerrit.wikimedia.org/r/1115375 (owner: 10Slyngshede) [12:10:56] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1126516|Bump thumbnail steps to 2% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:11:09] !log jiji@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kafka-main1002.eqiad.wmnet with reason: decom [12:11:16] !log jiji@cumin1002 START - Cookbook sre.hosts.decommission for hosts kafka-main[1001-1005].eqiad.wmnet [12:11:24] !log jiji@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kafka-main1003.eqiad.wmnet with reason: decom [12:11:40] !log jiji@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kafka-main1004.eqiad.wmnet with reason: decom [12:11:51] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:11:55] !log jiji@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kafka-main1005.eqiad.wmnet with reason: decom [12:14:10] (03PS1) 10Elukey: services: bump Kartotherian's instances to 50 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126538 (https://phabricator.wikimedia.org/T386926) [12:16:19] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps2009.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [12:16:25] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps2010.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [12:18:18] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126516|Bump thumbnail steps to 2% (T360589)]] (duration: 10m 18s) [12:18:22] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [12:19:44] (03CR) 10Elukey: "I am totally aware that most of you will say "Luca again? Really? Does he want the entire Wikikube cluster to run a service?"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126538 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [12:20:20] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10Thumbor, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10623201 (10PantheraLeo1359531) 12:19 PM CET 24 MiB {F58783006} [12:23:27] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: sync [12:23:34] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [12:23:42] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1126531 (owner: 10Slyngshede) [12:23:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10623205 (10BTullis) [12:30:10] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync [12:31:50] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [12:32:03] (03CR) 10Clément Goubert: [C:03+1] "lgtm, capacity is there." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126538 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [12:32:20] (03CR) 10Elukey: [C:03+2] services: bump Kartotherian's instances to 50 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126538 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [12:32:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10623239 (10BTullis) [12:33:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10623240 (10BTullis) [12:33:35] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:33:57] (03CR) 10Federico Ceratto: Implement Icinga notification check before pooling in a host (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [12:34:21] (03PS1) 10Hnowlan: helmfile: use integer instead of string in roll_restart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126541 [12:34:31] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:34:46] (03PS2) 10Hashar: Pin types-setuptools to support pkg_resources [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1126536 [12:34:49] (03PS1) 10Slyngshede: C:raid:perccli do not error out if controller is no in use [puppet] - 10https://gerrit.wikimedia.org/r/1126542 [12:34:50] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/admin 'sync'. [12:35:06] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [12:35:57] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1254 gradually with 4 steps - Pool in for T385141 [12:36:00] T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141 [12:36:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10623250 (10BTullis) I believe that I may have made an off-by-one error in calculating the number of drives/hosts. The list above is... [12:37:33] (03PS1) 10Ayounsi: Add sandbox firewall filter to sandbox vlan [homer/public] - 10https://gerrit.wikimedia.org/r/1126543 (https://phabricator.wikimedia.org/T388419) [12:37:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1037.eqiad.wmnet [12:37:59] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [12:38:23] (03CR) 10David Caro: [V:03+1 C:03+2] "I'm ok with any solution, I'll merge this, if you prefer any of the others, feel free to send a patch and I'll approve :)" [puppet] - 10https://gerrit.wikimedia.org/r/1126520 (owner: 10David Caro) [12:38:25] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1037.eqiad.wmnet with reason: remove from cluster for reimage [12:38:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10623254 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=43cdf866-0dde-4aee-ad05-0604c388b7b3) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [12:38:51] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [12:38:56] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [12:39:30] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [12:39:32] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [12:40:22] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps1009.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [12:40:28] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps1010.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [12:41:53] (03PS18) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [12:41:55] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main[1001-1005].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [12:42:14] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main[1001-1005].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [12:42:14] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:42:14] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-main[1001-1005].eqiad.wmnet [12:42:27] (03PS1) 10Ilias Sarantopoulos: ml-services: separate deployment for reference quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126545 (https://phabricator.wikimedia.org/T387019) [12:43:38] (03PS1) 10Gergő Tisza: Enable SUL3 signup for 10% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126546 (https://phabricator.wikimedia.org/T384218) [12:44:12] (03CR) 10Clément Goubert: [C:03+1] helmfile: use integer instead of string in roll_restart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126541 (owner: 10Hnowlan) [12:45:14] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126534 (https://phabricator.wikimedia.org/T382017) (owner: 10Cathal Mooney) [12:47:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126546 (https://phabricator.wikimedia.org/T384218) (owner: 10Gergő Tisza) [12:50:00] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:50:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T371742)', diff saved to https://phabricator.wikimedia.org/P74189 and previous config saved to /var/cache/conftool/dbconfig/20250311-125007-ladsgroup.json [12:50:13] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:54:52] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [12:54:52] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10623309 (10VRiley-WMF) The motherboard will be replaced today. [12:54:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T371742)', diff saved to https://phabricator.wikimedia.org/P74190 and previous config saved to /var/cache/conftool/dbconfig/20250311-125458-ladsgroup.json [12:55:57] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10623316 (10Marostegui) >>! In T387673#10623309, @VRiley-WMF wrote: > The motherboard will be replaced today. Great news, the host is fully ready for you to proceed. Thank you! [12:56:02] !log Stop MariaDB on db1246 T387673 [12:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:05] T387673: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673 [12:56:41] (03CR) 10Volans: "At first pass looks ok, do you have a host where to test it and a normal host where to test it still works?" [puppet] - 10https://gerrit.wikimedia.org/r/1126542 (owner: 10Slyngshede) [12:57:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [12:57:33] !log Poweroff db1246 T387673 [12:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10623337 (10Marostegui) @VRiley-WMF I switched the host off. [12:59:05] (03CR) 10Cathal Mooney: [C:03+1] Add sandbox firewall filter to sandbox vlan [homer/public] - 10https://gerrit.wikimedia.org/r/1126543 (https://phabricator.wikimedia.org/T388419) (owner: 10Ayounsi) [12:59:58] (03CR) 10Cathal Mooney: [C:03+2] Add base monitoring elements for new top-of-rack switches eqiad E8/F8 [puppet] - 10https://gerrit.wikimedia.org/r/1126534 (https://phabricator.wikimedia.org/T382017) (owner: 10Cathal Mooney) [13:01:00] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1037 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1126494 (owner: 10Muehlenhoff) [13:01:31] topranks: I'll puppet-merge your "add base monitoring" patch along-side, ok? [13:02:51] I went ahead and merged [13:04:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T371742)', diff saved to https://phabricator.wikimedia.org/P74191 and previous config saved to /var/cache/conftool/dbconfig/20250311-130458-ladsgroup.json [13:05:02] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:05:03] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [13:05:03] (03CR) 10Hashar: "This can be potentially improved by uncapping the dependency when using python 3.10 or later:" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1126536 (owner: 10Hashar) [13:06:48] (03PS1) 10Ladsgroup: Switch the footer link to wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126548 (https://phabricator.wikimedia.org/T387573) [13:07:27] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:08:04] moritzm: sorry missed that yes thank you [13:08:13] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:08:39] (03CR) 10Volans: [C:03+2] cookbook: make the default argument parser tunable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1126498 (owner: 10Volans) [13:09:04] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:09:09] (03CR) 10Klausman: [C:03+1] ml-services: separate deployment for reference quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126545 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:09:28] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:10:24] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [13:10:48] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [13:12:42] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Set up dual-stack ECDSA/RSA certificate support for Exim - https://phabricator.wikimedia.org/T385067#10623413 (10Vgutierrez) >>! In T385067#10620624, @jhathaway wrote: > I think this should be fine, looking through the logs it appears that almost al... [13:12:43] (03PS1) 10Klausman: home/klausman: Fix tmuxp recipes [puppet] - 10https://gerrit.wikimedia.org/r/1126549 [13:12:47] (03PS19) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [13:13:23] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: separate deployment for reference quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126545 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:13:37] (03PS1) 10Jforrester: Fix missing parens in TableOfContents.less [skins/Vector] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126550 (https://phabricator.wikimedia.org/T388475) [13:15:00] (03CR) 10CI reject: [V:04-1] haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [13:15:18] (03Merged) 10jenkins-bot: ml-services: separate deployment for reference quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126545 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:16:17] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:19:16] (03CR) 10Marostegui: Implement Icinga notification check before pooling in a host (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [13:20:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P74192 and previous config saved to /var/cache/conftool/dbconfig/20250311-132005-ladsgroup.json [13:21:28] (03Merged) 10jenkins-bot: cookbook: make the default argument parser tunable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1126498 (owner: 10Volans) [13:21:38] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10623472 (10MoritzMuehlenhoff) [13:22:37] (03PS20) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [13:25:17] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [13:25:57] (03CR) 10Tacsipacsi: "Thanks for backporting!" [skins/Vector] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126550 (https://phabricator.wikimedia.org/T388475) (owner: 10Jforrester) [13:27:49] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Set up dual-stack ECDSA/RSA certificate support for Exim - https://phabricator.wikimedia.org/T385067#10623486 (10jhathaway) > Please note that ECDHE there refers to Eliptic curve diffie hellman, the mechanism used to negoatiate a symmetric encryptio... [13:30:18] (03PS1) 10Sbisson: Disable CX unified dashboard on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126555 (https://phabricator.wikimedia.org/T387820) [13:31:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126555 (https://phabricator.wikimedia.org/T387820) (owner: 10Sbisson) [13:31:51] (03PS1) 10Effie Mouzeli: site.pp: decom kafka-main100[1-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1126556 (https://phabricator.wikimedia.org/T381593) [13:35:02] (03CR) 10Muehlenhoff: "They are also still listed in preseed.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1126556 (https://phabricator.wikimedia.org/T381593) (owner: 10Effie Mouzeli) [13:35:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P74193 and previous config saved to /var/cache/conftool/dbconfig/20250311-133512-ladsgroup.json [13:36:55] (03PS5) 10Federico Ceratto: Implement Icinga notification check before pooling in a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) [13:38:09] (03CR) 10Federico Ceratto: Implement Icinga notification check before pooling in a host (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [13:40:12] jouncebot: nowandnext [13:40:12] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [13:40:12] In 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1400) [13:40:20] does anyone mind if I already start with my deployments? [13:40:27] (03PS2) 10Effie Mouzeli: site.pp: decom kafka-main100[1-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1126556 (https://phabricator.wikimedia.org/T381593) [13:41:13] (03CR) 10Filippo Giunchedi: [C:03+1] Implement Icinga notification check before pooling in a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [13:42:06] I’ll go ahead then [13:42:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118487 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [13:42:35] (03CR) 10CI reject: [V:04-1] Remove Wikibase fixed RDF feature flag again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118487 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [13:42:45] (03PS4) 10Lucas Werkmeister (WMDE): Remove Wikibase fixed RDF feature flag again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118487 (https://phabricator.wikimedia.org/T384344) [13:42:55] (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118487 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [13:43:49] (03Merged) 10jenkins-bot: Remove Wikibase fixed RDF feature flag again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118487 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [13:44:18] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1118487|Remove Wikibase fixed RDF feature flag again (T384344)]] [13:44:21] T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344 [13:46:28] (03CR) 10Herron: [C:03+2] "Hey Moritz, in this case role::aux_k8s::worker_insetup is the same as role::aux_k8s::worker with the exception of the include profile::lvs" [puppet] - 10https://gerrit.wikimedia.org/r/1123434 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [13:47:12] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1118487|Remove Wikibase fixed RDF feature flag again (T384344)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:47:25] lgtm, syncing [13:47:26] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [13:47:27] (03CR) 10Filippo Giunchedi: "The files in the paste above are the ones I found without an explicit #! interpreter. They are either tests, or already run under python3" [puppet] - 10https://gerrit.wikimedia.org/r/1122090 (owner: 10Filippo Giunchedi) [13:47:51] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:48:17] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:48:33] (03CR) 10Herron: [C:03+1] alert: Remove stale vops-bot-sync-db* service [puppet] - 10https://gerrit.wikimedia.org/r/1126128 (https://phabricator.wikimedia.org/T388444) (owner: 10Andrea Denisse) [13:48:43] (03PS1) 10Gergő Tisza: Prepare Less styles for math=parens-division [extensions/MultimediaViewer] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126562 (https://phabricator.wikimedia.org/T382931) [13:48:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/MultimediaViewer] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126562 (https://phabricator.wikimedia.org/T382931) (owner: 10Gergő Tisza) [13:50:09] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126563 [13:50:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T371742)', diff saved to https://phabricator.wikimedia.org/P74194 and previous config saved to /var/cache/conftool/dbconfig/20250311-135019-ladsgroup.json [13:50:23] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:50:53] (03PS3) 10Effie Mouzeli: site.pp: decom kafka-main100[1-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1126556 (https://phabricator.wikimedia.org/T381593) [13:52:45] (03CR) 10Muehlenhoff: site.pp: decom kafka-main100[1-5].eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126556 (https://phabricator.wikimedia.org/T381593) (owner: 10Effie Mouzeli) [13:53:47] (03PS4) 10Effie Mouzeli: site.pp: decom kafka-main100[1-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1126556 (https://phabricator.wikimedia.org/T381593) [13:53:49] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118487|Remove Wikibase fixed RDF feature flag again (T384344)]] (duration: 09m 31s) [13:53:53] T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344 [13:54:24] alright, one less patch for the window in a few minutes [13:54:32] (less? fewer? patch? patches? idk) [13:54:55] (03CR) 10Effie Mouzeli: site.pp: decom kafka-main100[1-5].eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126556 (https://phabricator.wikimedia.org/T381593) (owner: 10Effie Mouzeli) [13:56:24] (03CR) 10Muehlenhoff: site.pp: decom kafka-main100[1-5].eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126556 (https://phabricator.wikimedia.org/T381593) (owner: 10Effie Mouzeli) [13:57:42] (03PS2) 10Ilias Sarantopoulos: api-gateway: change hosts for reference-risk/need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126523 (https://phabricator.wikimedia.org/T387019) [13:57:47] hi Lucas_WMDE o/ [13:58:13] hi! [13:58:38] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10623770 (10Jelto) > So that sounds promising that Bacula should be able to backup straight from an S3 storage, it seems to me. Won... [13:59:54] (03PS5) 10Effie Mouzeli: site.pp: decom kafka-main100[1-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1126556 (https://phabricator.wikimedia.org/T381593) [14:00:22] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1400) [14:00:23] Lucas_WMDE, Dreamy_Jazz, xSavitar, abijeet, tgr, and stephanebisson: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:31] o/ [14:00:32] o/ [14:00:33] o/ [14:00:34] My change has already been deployed [14:00:38] Forgot to remove it [14:00:42] I’m around but in a meeting, so I wouldn’t mind if someone else deploys :) [14:00:46] I also already deployed my change btw [14:00:49] o/ [14:01:09] o/ [14:02:09] ok, I can deploy anyway [14:02:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1126556 (https://phabricator.wikimedia.org/T381593) (owner: 10Effie Mouzeli) [14:02:17] let’s start with xSavitar and also +2 abijeet’s backport already [14:02:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125491 (https://phabricator.wikimedia.org/T388218) (owner: 10D3r1ck01) [14:02:45] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Translate] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126489 (https://phabricator.wikimedia.org/T388467) (owner: 10Abijeet Patro) [14:03:23] (03Merged) 10jenkins-bot: Set `$wgCentralAuthLoginWiki` to correct default as documented [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125491 (https://phabricator.wikimedia.org/T388218) (owner: 10D3r1ck01) [14:03:53] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1125491|Set `$wgCentralAuthLoginWiki` to correct default as documented (T388218)]] [14:03:56] T388218: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl() must be of the type string, null given, called in /srv/mediawiki/php-1.44.0-wmf.19/extensions/CentralAuth/includes/CentralDoma - https://phabricator.wikimedia.org/T388218 [14:04:01] Lucas_WMDE, thanks! [14:04:34] (03Merged) 10jenkins-bot: EventLogging: Improve handling when suggestions are not present [extensions/Translate] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126489 (https://phabricator.wikimedia.org/T388467) (owner: 10Abijeet Patro) [14:04:58] huh, that merge was fast ^^ [14:05:45] Lucas_WMDE: let's hold off on the MediaViewer patch for a bit. Trying to clarify if that's the right fix. [14:05:58] ok [14:06:43] (03PS9) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [14:06:43] (03PS10) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [14:06:48] !log lucaswerkmeister-wmde@deploy2002 d3r1ck01, lucaswerkmeister-wmde: Backport for [[gerrit:1125491|Set `$wgCentralAuthLoginWiki` to correct default as documented (T388218)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:06:51] Lucas_WMDE, thanks! Ready to test anytime. You let me know. [14:06:58] xSavitar: great timing, you can test now :) [14:07:07] OKay, on it ... [14:08:18] abijeet: sorry, some delayed but I think Lucas_WMDE is taking care :) [14:08:24] kart_, yup. [14:08:28] cool. [14:08:56] Lucas_WMDE: any changes in CI. I also noticed that Translate merges are super fast now. [14:08:59] Lucas_WMDE, all green. You can proceed :) [14:09:04] !log lucaswerkmeister-wmde@deploy2002 d3r1ck01, lucaswerkmeister-wmde: Continuing with sync [14:09:07] xSavitar: thanks! [14:10:12] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: move k8s-mlstaging to prometheus2008 [puppet] - 10https://gerrit.wikimedia.org/r/1126528 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:10:17] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: decom kafka-main100[1-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1126556 (https://phabricator.wikimedia.org/T381593) (owner: 10Effie Mouzeli) [14:12:31] (03CR) 10Scott French: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1126525 (https://phabricator.wikimedia.org/T388465) (owner: 10Clément Goubert) [14:14:15] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10623829 (10jijiki) [14:15:10] (03PS1) 10Gergő Tisza: Revert "ResourceLoader: Enable Less.php math=parens-division" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126567 (https://phabricator.wikimedia.org/T388475) [14:15:25] (03PS1) 10Herron: wip [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126568 [14:15:28] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1125491|Set `$wgCentralAuthLoginWiki` to correct default as documented (T388218)]] (duration: 11m 35s) [14:15:32] T388218: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl() must be of the type string, null given, called in /srv/mediawiki/php-1.44.0-wmf.19/extensions/CentralAuth/includes/CentralDoma - https://phabricator.wikimedia.org/T388218 [14:15:34] (03PS11) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [14:15:39] !log filippo@cumin1002 conftool action : set/pooled=no; selector: name=prometheus2006.codfw.wmnet [14:15:47] !log filippo@cumin1002 conftool action : set/pooled=no; selector: name=prometheus2008.codfw.wmnet [14:15:54] (03CR) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [14:16:11] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1126489|EventLogging: Improve handling when suggestions are not present (T388467)]] [14:16:14] T388467: TypeError: mtSuggestions.map is not a function - https://phabricator.wikimedia.org/T388467 [14:16:26] (03PS2) 10Gergő Tisza: Revert "ResourceLoader: Enable Less.php math=parens-division" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126567 (https://phabricator.wikimedia.org/T388475) [14:16:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126567 (https://phabricator.wikimedia.org/T388475) (owner: 10Gergő Tisza) [14:16:36] !log filippo@cumin1002 conftool action : set/weight=10; selector: name=prometheus2008.codfw.wmnet [14:16:49] (03CR) 10Jforrester: [C:03+1] "Thanks!" [extensions/MultimediaViewer] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126562 (https://phabricator.wikimedia.org/T382931) (owner: 10Gergő Tisza) [14:16:55] (03PS3) 10Volans: setup.py: pin types-setuptools for pkg_resources [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1126536 (owner: 10Hashar) [14:17:08] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10623838 (10herron) [14:17:19] (03CR) 10Volans: [C:03+2] "Thanks Antoine for the temporary fix. Hopefully we can remove all this soon." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1126536 (owner: 10Hashar) [14:17:33] (03CR) 10Nikerabbit: [C:03+1] MinT: Increase rediness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125093 (https://phabricator.wikimedia.org/T386889) (owner: 10KartikMistry) [14:17:39] (03Abandoned) 10Gergő Tisza: Prepare Less styles for math=parens-division [extensions/MultimediaViewer] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126562 (https://phabricator.wikimedia.org/T382931) (owner: 10Gergő Tisza) [14:18:27] (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [14:18:30] hmm, new error at the top of logspam-watch [14:18:31] e/C/i/CentralDomainUtils:140 MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl: Invalid wiki ID: [14:18:47] cc xSavitar, sounds like it could be related to your change? [14:18:48] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: move k8s-mlstaging to prometheus2008 [puppet] - 10https://gerrit.wikimedia.org/r/1126528 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:18:54] * Lucas_WMDE peeks at logstash [14:19:07] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, abi: Backport for [[gerrit:1126489|EventLogging: Improve handling when suggestions are not present (T388467)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:19:13] abijeet: please test [14:19:20] :) [14:19:49] Lucas_WMDE: replaced the parens-division patch with a different one [14:20:06] xSavitar: the error volume in logstash lines up pretty well with your config change so I think that’s the cause :( [14:20:11] will probably need a bit to test so let me know if you want to switch [14:20:31] (03PS12) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [14:20:41] tgr_: right now I’m thinking I’ll probably want to revert xSavitar’s config change first (after the current backport is done) [14:20:45] happy to hand over to you after that [14:20:46] the config change is fine to revert [14:20:54] (also stephanebisson is still in the queue) [14:22:18] (03Merged) 10jenkins-bot: setup.py: pin types-setuptools for pkg_resources [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1126536 (owner: 10Hashar) [14:22:38] (03PS2) 10Hnowlan: helmfile: use integer instead of string in roll_restart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126541 [14:23:06] (03PS2) 10Volans: interactive: notify when waiting for input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 [14:23:24] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:23:26] (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [14:24:00] abijeet (or kart_): can you test the EventLogging backport? [14:24:07] !log filippo@cumin1002 conftool action : set/pooled=yes; selector: name=prometheus2008.codfw.wmnet [14:24:36] uh oh, stashbot quit [14:25:00] (03CR) 10Gergő Tisza: [C:03+1] "Turns out the documentation was wrong and the live config right - we use `??` everywhere in the code for fallbacks. Let's just fix in exte" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125491 (https://phabricator.wikimedia.org/T388218) (owner: 10D3r1ck01) [14:25:07] Lucas_WMDE, we can revert, sorry stepped out for a bit. [14:25:12] ok thanks [14:25:18] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10623883 (10herron) Quick status update here, the aux-k8s codfw cluster is running and the aux-k8s-ctrl.svc.codfw.wmnet vip is live. ` deploy1003... [14:25:21] still waiting for the current deployment to finish first [14:25:25] unless you think it’s super urgent to revert [14:25:33] I haven’t checked what the impact of those logstash messages is [14:25:40] (03PS1) 10Jgiannelos: changeprop: Rollout more wikis for PCS/RESTBase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126569 (https://phabricator.wikimedia.org/T388140) [14:25:48] Lucas_WMDE checking logstash now [14:26:36] Lucas_WMDE, on it [14:26:36] (03PS1) 10Muehlenhoff: Remove obsolete custom partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1126570 (https://phabricator.wikimedia.org/T156955) [14:26:39] Lucas_WMDE: it's breaking autologins, not the end of the world. [14:26:47] abijeet: thanks [14:26:48] tgr_: ack [14:27:06] edge logins only, actually [14:27:59] (03CR) 10D3r1ck01: "Ack!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125491 (https://phabricator.wikimedia.org/T388218) (owner: 10D3r1ck01) [14:28:07] (03PS2) 10Volans: tests: remove unnecessary vulture setting [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125954 [14:28:20] (03PS1) 10D3r1ck01: Revert "Set `$wgCentralAuthLoginWiki` to correct default as documented" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126571 [14:28:30] tgr_, I think let's revert instead and I'll fix the docs. [14:28:50] Lucas_WMDE, tgr_ shall we: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1126571? [14:29:32] yes, once the current deployment is done [14:30:19] Thank you so much! [14:30:51] (03PS3) 10Gergő Tisza: Revert "ResourceLoader: Enable Less.php math=parens-division" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126567 (https://phabricator.wikimedia.org/T388475) [14:30:52] (03CR) 10Clément Goubert: [C:03+2] P:parsoid::testing: Set wikimedia-servergroup [puppet] - 10https://gerrit.wikimedia.org/r/1126525 (https://phabricator.wikimedia.org/T388465) (owner: 10Clément Goubert) [14:31:02] !log filippo@cumin1002 conftool action : set/pooled=yes; selector: name=prometheus2006.codfw.wmnet [14:31:48] Lucas_WMDE, looks good [14:31:51] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, abi: Continuing with sync [14:31:54] great, thanks! [14:32:35] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10623922 (10MoritzMuehlenhoff) [14:37:14] nooooooo [14:37:19] I accidentally Ctrl+C’ed my scap [14:37:25] in the middle of k8s deployment progress [14:37:28] what happens now… [14:37:53] !log accidentally Ctrl+C’ed ongoing scap, was last seen at 80% sync-prod-k8s progress [14:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:03] I… guess I just go ahead with the revert for xSavitar right now? [14:38:11] and then that’ll roll out everywhere and include the EventLogging fix too [14:38:13] sure [14:38:16] I think it will just go on [14:38:30] lock is still held at least [14:38:35] (/var/lock/scap.srv_mediawiki-staging.lock) [14:38:38] guess I’ll watch that file then [14:39:08] it has its own detached process IIRC, as long as it doesn't require user input it's pretty robust [14:39:20] ok [14:39:43] * Lucas_WMDE looks at https://logstash.wikimedia.org/app/dashboards#/view/f7e31de0-9f0d-11eb-863c-3588009e4dd9 [14:43:05] (03PS1) 10Brouberol: mediawiki: render configmaps when dumps are enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126574 (https://phabricator.wikimedia.org/T388378) [14:43:33] scap lock is still held… I feel like it should be done soonish [14:44:02] (last logstash message is at 14:36:53) [14:45:16] If you accidentally kill scap, it's always safe and recommended to re-run it to completion [14:45:31] with the original change, or will another change do? [14:45:41] Either is fine. [14:45:43] ok [14:45:46] let’s see if it works then [14:45:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126571 (owner: 10D3r1ck01) [14:46:31] (03CR) 10Marostegui: [C:03+1] Implement Icinga notification check before pooling in a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [14:46:43] (03PS1) 10Zoe: Remove Flow as the default talk system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126577 (https://phabricator.wikimedia.org/T383569) [14:46:50] (03Merged) 10jenkins-bot: Revert "Set `$wgCentralAuthLoginWiki` to correct default as documented" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126571 (owner: 10D3r1ck01) [14:46:52] 06SRE, 07SRE-Unowned: The ops-maint-gcal.js script is missing support for some vendors - https://phabricator.wikimedia.org/T381680#10623992 (10Aklapper) Ah, thanks (and sorry)! [14:47:09] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126574 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:47:18] ok it seems to be continuing with the deployment [14:47:19] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1126571|Revert "Set `$wgCentralAuthLoginWiki` to correct default as documented"]] [14:47:24] so I guess it usurped the previous lock or something [14:47:50] (in the `watch` tab I briefly saw it switch to “(no justification provided)” and now it’s showing the $wgCentralAuthLoginWiki revert as the reason) [14:47:56] thanks dancy! [14:48:36] xSavitar, tgr_: do you think it’s worth testing the revert on mwdebug? or should I just deploy it to get it rolled out faster? [14:48:43] (it’s not there yet, just asking in advance) [14:48:51] Testing should be quick [14:48:54] ok [14:49:05] just a login in verifying that edge-login also works. [14:49:11] Should be done in under 1 min [14:49:17] !log moving k8s-mlstaging off prometheus200[56] completed - T383232 [14:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:20] T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232 [14:50:14] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, d3r1ck01: Backport for [[gerrit:1126571|Revert "Set `$wgCentralAuthLoginWiki` to correct default as documented"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:50:19] xSavitar: please test :) [14:50:28] on it... [14:52:14] regular login works, edge-login works, errors are reducing (on logstash) [14:52:17] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, d3r1ck01: Continuing with sync [14:52:18] Lucas_WMDE ^ [14:52:21] ok, thanks! [14:53:24] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:53:36] (03PS2) 10Brouberol: mediawiki: render configmaps when dumps are enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126574 (https://phabricator.wikimedia.org/T388378) [14:55:07] Lucas_WMDE: time for stephanebisson's change in window? [14:55:22] not sure [14:55:32] I doubt it, actually [14:55:40] unless the SREs don’t need the upcoming window [14:55:49] next window is an office our, I doubt it blocks deploys [14:56:07] Right. I just saw that. [14:56:14] I’ll still be around [14:56:32] in any case the core change is a train unblocker so I'll definitely need to deploy that [14:56:45] yes! [14:57:52] (03CR) 10Hnowlan: [C:03+1] api-gateway: change hosts for reference-risk/need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126523 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:58:00] then let’s do that one next [14:58:10] (03CR) 10Filippo Giunchedi: [C:03+1] "\o/ always a good day when removing partman recipes" [puppet] - 10https://gerrit.wikimedia.org/r/1126570 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [14:58:48] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126571|Revert "Set `$wgCentralAuthLoginWiki` to correct default as documented"]] (duration: 11m 28s) [14:59:01] tgr_: want to deploy yourself or should I do it? [14:59:19] (also, want to include the config change with the backport or deploy that separately?) [15:00:05] jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1500). [15:00:27] we’d like to keep the backport window open for a few more patches if that’s okay [15:00:59] Lucas_WMDE: happy to leave it to you if you are up to it but also feel free to leave [15:01:28] I can do it, sure [15:01:30] probably fine to do in one? [15:01:44] ok [15:01:48] the config change seems very straightforward [15:01:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126567 (https://phabricator.wikimedia.org/T388475) (owner: 10Gergő Tisza) [15:01:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126546 (https://phabricator.wikimedia.org/T384218) (owner: 10Gergő Tisza) [15:02:10] (03CR) 10Hnowlan: [C:03+2] helmfile: use integer instead of string in roll_restart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126541 (owner: 10Hnowlan) [15:02:11] Lucas_WMDE, thanks, no more errors related to what we saw earlier on logstash. All gone. [15:02:17] \o/ [15:02:39] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10624139 (10VRiley-WMF) Dell has had a delay in the part being delivered. They estimate that this should be in tomorrow. Postponed until tomorrow. [15:02:41] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! Applies cleanly on latest git master thus +2" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1125595 (owner: 10Pppery) [15:02:47] doh, I thought you were asking about the ContentTranslation config change [15:02:47] (03Merged) 10jenkins-bot: Enable SUL3 signup for 10% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126546 (https://phabricator.wikimedia.org/T384218) (owner: 10Gergő Tisza) [15:02:48] (03CR) 10Btullis: [C:03+1] mediawiki: render configmaps when dumps are enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126574 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [15:03:01] oh, sorry [15:03:03] anyway, should still be fine, it's just a volume bump [15:03:14] yeah, I thought “very straightforward” could describe that one too ^^ [15:03:16] (03CR) 10Klausman: [C:03+1] api-gateway: change hosts for reference-risk/need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126523 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:03:26] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10624149 (10MoritzMuehlenhoff) [15:03:27] my bad, in hindsight it was obvious in context what you meant [15:04:19] (03CR) 10Scott French: "Thanks, Moritz! Antoine, I'll sync with you later on to confirm that this has made it to beta." [puppet] - 10https://gerrit.wikimedia.org/r/1125529 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [15:04:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10624158 (10Marostegui) Thanks for the update. The host will remain off, so proceed whenever you want. [15:04:27] (03CR) 10Scott French: [C:03+2] P:mediawiki::php: install PCRE2 backport from component/php81 [puppet] - 10https://gerrit.wikimedia.org/r/1125529 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [15:04:51] (03PS3) 10Brouberol: mediawiki: render configmaps when dumps are enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126574 (https://phabricator.wikimedia.org/T388378) [15:05:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10624167 (10VRiley-WMF) a:03VRiley-WMF [15:05:34] I mean… CI is still running [15:05:41] I guess I could just Ctrl+C scap and re-run it with the config change too [15:05:48] (03CR) 10Clément Goubert: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126574 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [15:05:56] yeah lemme just do that [15:06:01] (03CR) 10Ilias Sarantopoulos: [C:03+2] api-gateway: change hosts for reference-risk/need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126523 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:06:03] (03CR) 10Ebernhardson: "If i understand this proposal, the problem is going to be that without the UI requests routing through the wcqs servers, there is no path " [puppet] - 10https://gerrit.wikimedia.org/r/1118074 (https://phabricator.wikimedia.org/T381909) (owner: 10Jelto) [15:06:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126567 (https://phabricator.wikimedia.org/T388475) (owner: 10Gergő Tisza) [15:06:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126555 (https://phabricator.wikimedia.org/T387820) (owner: 10Sbisson) [15:06:22] stephanebisson: ^ FYI I’m about to deploy your idwiki CX config change [15:06:34] (03Merged) 10jenkins-bot: Revert "ResourceLoader: Enable Less.php math=parens-division" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126567 (https://phabricator.wikimedia.org/T388475) (owner: 10Gergő Tisza) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:43] I'm ready to test [15:06:58] (03Merged) 10jenkins-bot: Disable CX unified dashboard on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126555 (https://phabricator.wikimedia.org/T387820) (owner: 10Sbisson) [15:07:30] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1126567|Revert "ResourceLoader: Enable Less.php math=parens-division" (T388475 T388526)]], [[gerrit:1126546|Enable SUL3 signup for 10% of group 2 users (T384218)]], [[gerrit:1126555|Disable CX unified dashboard on idwiki (T387820)]] [15:07:38] T388475: [regression, subtask] Table of contents dropdown icons overlapping with text - https://phabricator.wikimedia.org/T388475 [15:07:38] T388526: MediaViewer broken in wmf.20 due to LESS change: Less_Exception_Compiler: error evaluating function `floor` math functions take numbers as parameters index: 1191 in mmv.ui.metadataPanel.less - https://phabricator.wikimedia.org/T388526 [15:07:38] T384218: SUL3 Phase 2: Staged rollout for all new account creation - https://phabricator.wikimedia.org/T384218 [15:07:39] T387820: Deploy unified dashboard on 10 more wikis (phase 2) - https://phabricator.wikimedia.org/T387820 [15:08:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:08:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:08:51] How on earth are the merges to be so fast today? [15:09:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10624204 (10Jhancock.wm) @elukey do you need this server for any other testing? wanted to pass it along if not. [15:09:39] stephanebisson: for backports, I believe it’s a combination of PHPUnit tests running in parallel and browser tests having been disabled on wmf branches [15:09:58] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10624207 (10Jhancock.wm) working on this, it's just being a little tantrum-y [15:10:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:10:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:10:31] !log lucaswerkmeister-wmde@deploy2002 sbisson, tgr, lucaswerkmeister-wmde: Backport for [[gerrit:1126567|Revert "ResourceLoader: Enable Less.php math=parens-division" (T388475 T388526)]], [[gerrit:1126546|Enable SUL3 signup for 10% of group 2 users (T384218)]], [[gerrit:1126555|Disable CX unified dashboard on idwiki (T387820)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:10:37] (03Merged) 10jenkins-bot: helmfile: use integer instead of string in roll_restart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126541 (owner: 10Hnowlan) [15:10:42] tgr_, stephanebisson: please test :) [15:11:07] (I guess the SUL3 volume increase will be tricky to test ^^) [15:11:24] Lucas_WMDE working as expected [15:11:29] nice [15:11:35] we only have wmf.20 on the testwikis, right? [15:11:42] I believe so yeah [15:13:14] (03PS2) 10Jgiannelos: changeprop: Rollout more wikis for PCS/RESTBase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126569 (https://phabricator.wikimedia.org/T388140) [15:13:24] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:13:25] ugh, we have a test-commons wiki but it doesn't have any files? [15:13:58] yeah :( [15:14:02] at least it has real Commons as a foreign repo [15:14:06] I think it would be super useful to have but not everyone agrees [15:14:40] (it used to have files and stuff and then it was cleared out again) [15:15:43] (03Merged) 10jenkins-bot: api-gateway: change hosts for reference-risk/need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126523 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:16:44] (03PS3) 10Scott French: Profile::Mediawiki_deployment: add 'deploy' field to release config [puppet] - 10https://gerrit.wikimedia.org/r/1125473 (https://phabricator.wikimedia.org/T387917) [15:16:44] (03PS3) 10Scott French: hieradata: add mw-script non-deploy releases to mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1125474 (https://phabricator.wikimedia.org/T387917) [15:16:45] (03PS3) 10Scott French: deployment_server: Use mw-script release values file [puppet] - 10https://gerrit.wikimedia.org/r/1125475 (https://phabricator.wikimedia.org/T387917) [15:18:22] Lucas_WMDE: seems to work correctly [15:18:28] yay, thanks [15:18:29] !log lucaswerkmeister-wmde@deploy2002 sbisson, tgr, lucaswerkmeister-wmde: Continuing with sync [15:20:53] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:21:16] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:24:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123435 (https://phabricator.wikimedia.org/T382147) (owner: 10Jsn.sherman) [15:24:52] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126567|Revert "ResourceLoader: Enable Less.php math=parens-division" (T388475 T388526)]], [[gerrit:1126546|Enable SUL3 signup for 10% of group 2 users (T384218)]], [[gerrit:1126555|Disable CX unified dashboard on idwiki (T387820)]] (duration: 17m 22s) [15:24:59] T388475: [regression, subtask] Table of contents dropdown icons overlapping with text - https://phabricator.wikimedia.org/T388475 [15:24:59] T388526: MediaViewer broken in wmf.20 due to LESS change: Less_Exception_Compiler: error evaluating function `floor` math functions take numbers as parameters index: 1191 in mmv.ui.metadataPanel.less - https://phabricator.wikimedia.org/T388526 [15:25:00] T384218: SUL3 Phase 2: Staged rollout for all new account creation - https://phabricator.wikimedia.org/T384218 [15:25:00] T387820: Deploy unified dashboard on 10 more wikis (phase 2) - https://phabricator.wikimedia.org/T387820 [15:25:16] !log UTC afternoon backport+config window done [15:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:28] thanks Lucas_WMDE! [15:25:42] !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:26:30] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10624292 (10MoritzMuehlenhoff) [15:26:42] !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:27:55] (03CR) 10Ebernhardson: "In terms of rational, what the wcqs service needs is some sort of marker that says who is using the service, such that there is someone to" [puppet] - 10https://gerrit.wikimedia.org/r/1118074 (https://phabricator.wikimedia.org/T381909) (owner: 10Jelto) [15:28:43] (03PS21) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [15:30:05] brouberol: you can go ahead with the scap for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126574 [15:30:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126577 (https://phabricator.wikimedia.org/T383569) (owner: 10Zoe) [15:31:18] claime: ack thanks [15:31:47] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [15:32:01] !log brouberol@deploy2002 Started scap sync-world: mediawiki: render configmaps when dumps are enabled - T388378 [15:32:04] T388378: Orchestrate dumps v1 from an airflow instance - https://phabricator.wikimedia.org/T388378 [15:33:58] !log brouberol@deploy2002 Finished scap sync-world: mediawiki: render configmaps when dumps are enabled - T388378 (duration: 02m 18s) [15:34:43] (03PS4) 10Scott French: deployment_server: Use mw-script release values file [puppet] - 10https://gerrit.wikimedia.org/r/1125475 (https://phabricator.wikimedia.org/T387917) [15:35:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10624351 (10elukey) >>! In T381274#10624204, @Jhancock.wm wrote: > @elukey do you need this server for any other testing? wanted to pass it along if not. Yes I ne... [15:36:20] (03CR) 10Effie Mouzeli: [C:03+1] deployment_server: Use mw-script release values file [puppet] - 10https://gerrit.wikimedia.org/r/1125475 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [15:36:29] (03CR) 10Scott French: "FYI, I fixed one stray reference to mw-web in the description of the `--mediawiki_image` flag." [puppet] - 10https://gerrit.wikimedia.org/r/1125475 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [15:36:35] (03CR) 10Vgutierrez: haproxy: certificate check script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:53] !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:37:03] (03PS6) 10Bking: cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) [15:37:19] !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:37:23] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10624358 (10jcrespo) >>! In T378922#10623769, @Jelto wrote: >> So that sounds promising that Bacula should be able to backup straig... [15:37:43] (03CR) 10Effie Mouzeli: [C:03+1] hieradata: add mw-script non-deploy releases to mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1125474 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [15:37:49] (03CR) 10Tjones: "The core dict is the right one for sure. But do we want to use the "latest" version? It gets updated a few times a year, so if we reload o" [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1125533 (https://phabricator.wikimedia.org/T386868) (owner: 10Ebernhardson) [15:38:57] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nicely done!" [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [15:39:51] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10624368 (10aborrero) [15:40:49] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10624373 (10MoritzMuehlenhoff) [15:45:12] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [15:45:22] (03CR) 10Slyngshede: "Sure, it's currently broken on thanos-be1005, but working on sretest1003." [puppet] - 10https://gerrit.wikimedia.org/r/1126542 (owner: 10Slyngshede) [15:45:32] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [15:46:29] (03CR) 10Vgutierrez: haproxy: certificate check script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [15:47:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:48:13] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1109454 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [15:49:03] !log test liberica 0.10 in lvs1013 [15:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:24] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:27] (03CR) 10Volans: [C:03+2] puppetdb: add support for structured facts [software/cumin] - 10https://gerrit.wikimedia.org/r/1125974 (https://phabricator.wikimedia.org/T372666) (owner: 10Volans) [15:49:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10624466 (10phaultfinder) [15:49:54] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10624467 (10MoritzMuehlenhoff) 05Open→03Resolved I'm resolving the task given the NDA is sorted and the tracking has been up... [15:52:16] (03CR) 10Volans: [C:03+2] sre.deploy: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124399 (owner: 10Volans) [15:52:32] !log upload liberica 0.10 to bookworm-wikimedia (apt.wm.o) [15:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:44] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10624490 (10fnegri) a:05VRiley-WMF→03fnegri Claiming this to take care of the depool, I will reassign to @VRiley-WMF when this is safe to shut... [15:52:57] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10624493 (10fnegri) 05Open→03In progress [15:53:03] (03CR) 10David Caro: [V:03+1] "Done, it does not :)" [puppet] - 10https://gerrit.wikimedia.org/r/1109454 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [15:53:19] (03CR) 10David Caro: [V:03+1 C:03+2] ceph::conf: allow passing min_delay option [puppet] - 10https://gerrit.wikimedia.org/r/1109454 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [15:53:25] (03CR) 10Cathal Mooney: [C:03+2] Add sandbox firewall filter to sandbox vlan [homer/public] - 10https://gerrit.wikimedia.org/r/1126543 (https://phabricator.wikimedia.org/T388419) (owner: 10Ayounsi) [15:53:35] (03PS4) 10David Caro: ceph::conf: allow passing min_delay option [puppet] - 10https://gerrit.wikimedia.org/r/1109454 (https://phabricator.wikimedia.org/T371501) [15:54:09] (03PS3) 10Hnowlan: switchdc: stop and restart crons as part of swithover process [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) [15:54:16] (03CR) 10David Caro: [C:03+2] ceph::conf: allow passing min_delay option [puppet] - 10https://gerrit.wikimedia.org/r/1109454 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [15:54:52] (03PS22) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [15:55:35] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM. Let me know if you need assistance when merging this, with the reprepro server." [puppet] - 10https://gerrit.wikimedia.org/r/1126054 (https://phabricator.wikimedia.org/T362868) (owner: 10FNegri) [15:55:40] (03CR) 10Scott French: "Thanks again for the review! I'll move ahead with this one, as it introduces no functional / content changes as is. The next patch will re" [puppet] - 10https://gerrit.wikimedia.org/r/1125473 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [15:55:42] (03CR) 10Scott French: [C:03+2] Profile::Mediawiki_deployment: add 'deploy' field to release config [puppet] - 10https://gerrit.wikimedia.org/r/1125473 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [15:56:05] (03PS23) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [15:56:21] (03CR) 10Fabfur: haproxy: certificate check script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [15:58:20] jouncebot: now [15:58:20] For the next 0 hour(s) and 1 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1500) [15:58:24] jouncebot: nowandnext [15:58:24] For the next 0 hour(s) and 1 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1500) [15:58:24] In 0 hour(s) and 1 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1600) [15:58:41] !log dzahn@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phabricator.wikimedia.org with reason: phabricator deploy [15:58:49] !log dzahn@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phab.wmfusercontent.org with reason: phabricator deploy [15:58:52] !log jelto@cumin1002 START - Cookbook sre.hosts.remove-downtime for kubestage[2001-2004].codfw.wmnet [15:58:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubestage[2001-2004].codfw.wmnet [15:58:59] (03Merged) 10jenkins-bot: Add sandbox firewall filter to sandbox vlan [homer/public] - 10https://gerrit.wikimedia.org/r/1126543 (https://phabricator.wikimedia.org/T388419) (owner: 10Ayounsi) [15:59:07] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: phabricator deploy [15:59:15] !log jelto@cumin1002 START - Cookbook sre.hosts.remove-downtime for kubestagemaster[2003-2005].codfw.wmnet [15:59:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubestagemaster[2003-2005].codfw.wmnet [15:59:25] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: phabricator deploy [15:59:54] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [16:00:05] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:01] rzl: jhathaway: any objections if use your window to merge a puppet patch that requires a coordinated scap deployment? [16:01:08] !log brennen@deploy2002 Started deploy [phabricator/deployment@714f3c7]: deploy phab2002 for T388551 [16:01:09] nope [16:01:12] T388551: Deploy Phabricator/Phorge 2025-03-11 - https://phabricator.wikimedia.org/T388551 [16:01:29] jhathaway: great, thank you [16:01:33] (03PS1) 10Ilias Sarantopoulos: ml-services: remove old ref-quality deployment and increase resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126592 (https://phabricator.wikimedia.org/T387019) [16:01:37] !log brennen@deploy2002 Finished deploy [phabricator/deployment@714f3c7]: deploy phab2002 for T388551 (duration: 00m 29s) [16:01:54] hashar: did you plan to deploy? (I see your jouncebot query above ^) [16:01:56] !log brennen@deploy2002 Started deploy [phabricator/deployment@714f3c7]: deploy phab1004 for T388551 [16:02:15] (03CR) 10Elukey: [C:03+1] sre.hosts: use new run_cookbook features (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans) [16:02:59] (03CR) 10CI reject: [V:04-1] ml-services: remove old ref-quality deployment and increase resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126592 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [16:02:59] !log brennen@deploy2002 Finished deploy [phabricator/deployment@714f3c7]: deploy phab1004 for T388551 (duration: 01m 02s) [16:04:25] (03CR) 10David Caro: [C:03+1] "LGTM, You'll need to manually remove the components though:" [puppet] - 10https://gerrit.wikimedia.org/r/1126054 (https://phabricator.wikimedia.org/T362868) (owner: 10FNegri) [16:04:41] (03CR) 10Scott French: "Thank you both for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1125474 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [16:05:30] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T388561 (10TRIBU_sig) 03NEW Closing this task as invalid due to missing information. [16:05:45] (03PS2) 10Ilias Sarantopoulos: ml-services: remove old ref-quality deployment and increase resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126592 (https://phabricator.wikimedia.org/T387019) [16:06:43] alright, going to move ahead [16:06:48] (03CR) 10Scott French: [C:03+2] hieradata: add mw-script non-deploy releases to mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1125474 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [16:07:11] (03Merged) 10jenkins-bot: puppetdb: add support for structured facts [software/cumin] - 10https://gerrit.wikimedia.org/r/1125974 (https://phabricator.wikimedia.org/T372666) (owner: 10Volans) [16:07:24] (03CR) 10CI reject: [V:04-1] ml-services: remove old ref-quality deployment and increase resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126592 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [16:07:27] (03CR) 10Clément Goubert: "LGTM for `mw-cron`, will need special attention that CronJob objects and children are correctly deleted since it's the first time we do th" [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [16:07:33] (03CR) 10Clément Goubert: [C:03+1] switchdc: stop and restart crons as part of swithover process [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [16:08:14] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T388562 (10TRIBU_sig) 03NEW [16:09:43] (03PS3) 10Ilias Sarantopoulos: ml-services: remove old ref-quality deployment and increase resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126592 (https://phabricator.wikimedia.org/T387019) [16:10:05] (03CR) 10Vgutierrez: haproxy: certificate check script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [16:11:01] (03CR) 10Ssingh: cache: enable benthos on A:cp-text_ulsfo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1126190 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [16:11:04] (03PS3) 10Ebernhardson: Add sudachi analyzer for japanese [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1125533 (https://phabricator.wikimedia.org/T386868) [16:11:32] (03PS3) 10Fabfur: cache,haproxy: use parametrized tmpfiles cert dir [puppet] - 10https://gerrit.wikimedia.org/r/1126517 (https://phabricator.wikimedia.org/T387826) [16:13:41] (03PS1) 10David Caro: cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) [16:14:04] (03CR) 10CI reject: [V:04-1] cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [16:14:31] (03CR) 10Bking: cirrussearch: Add alerts for thread pool rejections (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [16:14:35] (03CR) 10FNegri: [C:03+2] "Ack, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1126054 (https://phabricator.wikimedia.org/T362868) (owner: 10FNegri) [16:15:08] (03PS3) 10Fabfur: cache: enable benthos on A:cp-text_ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1126190 (https://phabricator.wikimedia.org/T329332) [16:15:29] (03CR) 10Ebernhardson: "Makes sense, i hadn't thought about the implications of the dictionary changing. I suppose that means if we ever change the dictionary in" [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1125533 (https://phabricator.wikimedia.org/T386868) (owner: 10Ebernhardson) [16:15:37] (03PS4) 10Fabfur: cache: enable benthos on A:cp-text_ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1126190 (https://phabricator.wikimedia.org/T329332) [16:15:55] (03CR) 10Klausman: [C:03+1] ml-services: remove old ref-quality deployment and increase resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126592 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [16:15:56] (03CR) 10Fabfur: cache: enable benthos on A:cp-text_ulsfo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1126190 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [16:16:07] !log swfrench@deploy2002 Started scap sync-world: No-op deploy to pick up mediawiki-deployments.yaml changes - T387917 [16:16:11] T387917: Migrate mw-script to PHP 8.1 - https://phabricator.wikimedia.org/T387917 [16:17:00] (03CR) 10Dzahn: [C:03+2] Phabricator: Add "video/mp4" to files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/1101481 (https://phabricator.wikimedia.org/T309222) (owner: 10Aklapper) [16:17:02] (03PS2) 10David Caro: cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) [16:17:03] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: remove old ref-quality deployment and increase resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126592 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [16:17:11] (03CR) 10Ebernhardson: "Do we have a plan for the runbook here? It seems like this is one of the rare (shouldn't be so rare, but such is life...) cases where we d" [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [16:17:29] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126190 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [16:17:37] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T388562#10624721 (10Aklapper) 05Open→03Declined Declining as there is no Wikimedia Affiliate supporting. Please see https://wikitech.wikimedia.org/wiki/Maps/External_usage : //maps.wikimedia.org tile... [16:18:03] (03PS3) 10David Caro: cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) [16:18:16] !log swfrench@deploy2002 Finished scap sync-world: No-op deploy to pick up mediawiki-deployments.yaml changes - T387917 (duration: 02m 42s) [16:18:27] (03Merged) 10jenkins-bot: ml-services: remove old ref-quality deployment and increase resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126592 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [16:19:40] (03CR) 10Scott French: "Thank you both for the review! The release values are live, so this is safe to merge now." [puppet] - 10https://gerrit.wikimedia.org/r/1125475 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [16:19:43] (03CR) 10Scott French: [C:03+2] deployment_server: Use mw-script release values file [puppet] - 10https://gerrit.wikimedia.org/r/1125475 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [16:19:55] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [16:20:10] (03CR) 10CI reject: [V:04-1] cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [16:20:37] (03CR) 10Ssingh: [C:03+1] "Looks good, comparing against the upload changes!" [puppet] - 10https://gerrit.wikimedia.org/r/1126190 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [16:21:14] (03CR) 10Fabfur: [C:03+2] cache: enable benthos on A:cp-text_ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1126190 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [16:21:32] (03PS4) 10David Caro: cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) [16:22:04] (03PS1) 10Aklapper: Phabricator: Fix typo in files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/1126599 (https://phabricator.wikimedia.org/T309222) [16:22:26] (03CR) 10Dzahn: [C:03+2] Phabricator: Fix typo in files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/1126599 (https://phabricator.wikimedia.org/T309222) (owner: 10Aklapper) [16:22:28] (03CR) 10Dzahn: [V:03+2 C:03+2] Phabricator: Fix typo in files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/1126599 (https://phabricator.wikimedia.org/T309222) (owner: 10Aklapper) [16:22:38] (03CR) 10Giuseppe Lavagetto: mediawiki: introduce feature flags (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [16:23:27] (03PS3) 10Jsn.sherman: Add MP event stream for MassDelete workflows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123435 (https://phabricator.wikimedia.org/T382147) [16:23:45] (03CR) 10CI reject: [V:04-1] cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [16:24:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10624760 (10elukey) >>! In T384003#10623146, @MatthewVernon wrote: > I guess one thing to do might be to do some I/O workload during a... [16:24:12] (03PS5) 10David Caro: cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) [16:24:28] (03PS8) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [16:24:29] (03PS10) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [16:24:29] (03PS10) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [16:24:33] (03PS13) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [16:25:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10624764 (10Jhancock.wm) [16:25:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T371742)', diff saved to https://phabricator.wikimedia.org/P74197 and previous config saved to /var/cache/conftool/dbconfig/20250311-162530-ladsgroup.json [16:25:34] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:26:44] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [16:27:22] (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [16:27:31] (03CR) 10FNegri: [C:03+2] "That's what I did after merging:" [puppet] - 10https://gerrit.wikimedia.org/r/1126054 (https://phabricator.wikimedia.org/T362868) (owner: 10FNegri) [16:28:43] (03PS1) 10Fabfur: Revert "cache: enable benthos on A:cp-text_ulsfo" [puppet] - 10https://gerrit.wikimedia.org/r/1126601 [16:29:11] (03CR) 10Fabfur: [C:03+2] Revert "cache: enable benthos on A:cp-text_ulsfo" [puppet] - 10https://gerrit.wikimedia.org/r/1126601 (owner: 10Fabfur) [16:29:24] (03CR) 10David Caro: [V:03+1] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [16:30:04] (03PS1) 10Clément Goubert: mediawiki::maintenance: Add backfill_localaccounts periodic jobs [puppet] - 10https://gerrit.wikimedia.org/r/1126603 (https://phabricator.wikimedia.org/T388564) [16:30:08] (03CR) 10Hnowlan: [C:03+1] benthos-mw-accesslog-metrics: create deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 (owner: 10Kamila Součková) [16:30:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:32:44] (03PS2) 10Clément Goubert: mediawiki::maintenance: Add backfill_localaccounts periodic jobs [puppet] - 10https://gerrit.wikimedia.org/r/1126603 (https://phabricator.wikimedia.org/T388564) [16:32:47] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126603 (https://phabricator.wikimedia.org/T388564) (owner: 10Clément Goubert) [16:32:53] !log brennen@deploy2002 Started deploy [phabricator/deployment@714f3c7]: redeploy phab2002 for T309222 [16:33:00] T309222: Unable to preview MP4 video in Phabricator task comments and descriptions - https://phabricator.wikimedia.org/T309222 [16:33:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:33:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:33:56] !log brennen@deploy2002 Finished deploy [phabricator/deployment@714f3c7]: redeploy phab2002 for T309222 (duration: 01m 03s) [16:34:21] !log brennen@deploy2002 Started deploy [phabricator/deployment@714f3c7]: redeploy phab1004 for T309222 [16:34:46] fabfur: any relation to "Server Error: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null" ? [16:35:41] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:36:01] !log brennen@deploy2002 Finished deploy [phabricator/deployment@714f3c7]: redeploy phab1004 for T309222 (duration: 01m 40s) [16:37:21] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10624853 (10AStein-WMF) @MoritzMuehlenhoff my user is `astein`! [16:40:02] jouncebot: nowandnext [16:40:02] For the next 0 hour(s) and 19 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1600) [16:40:02] In 0 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1700) [16:40:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P74198 and previous config saved to /var/cache/conftool/dbconfig/20250311-164038-ladsgroup.json [16:44:59] since no deployments are ongoing, I'm going to start some potentially long-running work for the upcoming infra window a bit early [16:45:18] (03CR) 10Scott French: "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1125536 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:47:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10624887 (10Neobeta61) Thanks for the assistance and conversations guys. Buona Fortuna con tutto il resto! [16:47:48] (03CR) 10Scott French: [V:03+2] "Built and verified to contain the correct PCRE2 locally." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1125536 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:47:50] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: Install PCRE2 backport from component/php81 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1125536 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:50:47] (03PS1) 10Fabfur: hiera: fix previous commit on ulsfo text hiera [puppet] - 10https://gerrit.wikimedia.org/r/1126605 (https://phabricator.wikimedia.org/T329332) [16:51:10] !log herron@cumin1002 START - Cookbook sre.dns.netbox [16:51:22] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126605 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [16:52:56] !log upload liberica 0.11 to bookworm-wikimedia (apt.wm.o) [16:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:38] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:53:40] !log test liberica 0.11 in lvs1013 [16:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:41] !log rebuilt php8.1 production images to pick up PCRE2 backport from component/php81 - T386006 [16:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:45] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [16:55:13] starting scap to pick up the above production image change shortly [16:55:25] (03CR) 10Bking: [C:04-1] "Good catch, I started to write it at https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Thread_pools . I'll finish th" [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [16:55:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P74199 and previous config saved to /var/cache/conftool/dbconfig/20250311-165545-ladsgroup.json [16:58:24] !log swfrench@deploy2002 Started scap sync-world: Deployment to pick up new php8.1 production image - T386006 [16:58:49] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1011* for ban host prior to reimage - bking@cumin2002 - T387904 [16:58:52] T387904: Migrate Cloudelastic to Opensearch - https://phabricator.wikimedia.org/T387904 [16:58:52] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1011* for ban host prior to reimage - bking@cumin2002 - T387904 [17:00:04] swfrench-wmf: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki infrastructure (UTC late) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1700). [17:00:10] o/ [17:00:34] already in progress, though slightly inverted from the ordering in the deployment calendar [17:00:35] (03PS2) 10Fabfur: hiera: fix previous commit on ulsfo text hiera [puppet] - 10https://gerrit.wikimedia.org/r/1126605 (https://phabricator.wikimedia.org/T329332) [17:00:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:01:31] (03CR) 10Ssingh: [C:03+1] "We should paste the diff here between the broken commit for posterity but looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1126605 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [17:03:13] (03CR) 10Fabfur: [C:03+2] hiera: fix previous commit on ulsfo text hiera [puppet] - 10https://gerrit.wikimedia.org/r/1126605 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [17:04:44] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10624975 (10MatthewVernon) Sorry, still seeing errors about two of the drives: ` mvernon@ms-be2075:~$ grep 'Power-on' /var/log/kern.log | cut -f 8 -d ' ' | sort | uniq -c 648... [17:07:06] (03PS10) 10Vgutierrez: sre.loadbalancer: Add liberica-admin cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) [17:10:15] (03PS2) 10Hnowlan: httpbb: use k8s jobrunners for healthchecking [puppet] - 10https://gerrit.wikimedia.org/r/1112728 (https://phabricator.wikimedia.org/T383317) [17:10:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T371742)', diff saved to https://phabricator.wikimedia.org/P74200 and previous config saved to /var/cache/conftool/dbconfig/20250311-171052-ladsgroup.json [17:10:57] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:11:52] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4051.ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [17:12:33] (03CR) 10Effie Mouzeli: hieradata: switch all releases of mw-(apt-ext|web) to 8.1 (1 of 2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125501 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:14:45] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4051.ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [17:15:36] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): scale next down to 1 replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125508 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:18:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T387768) (owner: 10Kimberly Sarabia) [17:18:40] (03PS9) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [17:18:40] (03PS11) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [17:18:40] (03PS11) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [17:18:41] (03PS14) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [17:21:13] (03PS1) 10Effie Mouzeli: hieradata: switch mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845) [17:21:19] (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [17:22:40] (03CR) 10BCornwall: [C:03+1] create k8s-ingress-aux -ro and -rw discovery records, metafo/geodns [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [17:22:46] (03CR) 10BCornwall: [C:03+1] add ingress service aliases for codesearch on k8s-aux [dns] - 10https://gerrit.wikimedia.org/r/1126177 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [17:23:43] (03PS2) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all traffic to PHP8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [17:24:18] (03CR) 10Kamila Součková: [C:03+1] k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1115380 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:24:19] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1011.eqiad.wmnet with OS bullseye [17:24:22] !log swfrench@deploy2002 Finished scap sync-world: Deployment to pick up new php8.1 production image - T386006 (duration: 26m 26s) [17:24:25] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [17:24:26] (03CR) 10BCornwall: [C:03+2] upgrade-varnish: Remove vmods/varnish explicitly [cookbooks] - 10https://gerrit.wikimedia.org/r/1126151 (owner: 10BCornwall) [17:26:55] !nowandnext [17:27:05] (03CR) 10BCornwall: [C:03+1] create codesearch.wikimedia.org, point to standard DYNA [dns] - 10https://gerrit.wikimedia.org/r/1126176 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [17:27:10] jouncebot: nowandnext [17:27:10] For the next 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1700) [17:27:10] In 0 hour(s) and 32 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1800) [17:27:23] (03CR) 10JMeybohm: [C:03+2] k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1115380 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:28:08] kamila_: I'm about to deploy the second planned change of the infra window [17:28:25] swfrench-wmf: ok, thanks :-) I was just typing up the question :D [17:28:54] (03CR) 10BCornwall: [C:03+1] add k8s ingress service aliases for jaeger in codfw [dns] - 10https://gerrit.wikimedia.org/r/1126180 (https://phabricator.wikimedia.org/T345894) (owner: 10Dzahn) [17:29:00] kamila_: is it something urgent? if so, I we can try to coordinate [17:29:07] no no [17:29:11] s/I we/we/ :) [17:29:12] I'm just bored :D [17:29:20] well I'm not, but I'm not in a hurry [17:29:45] (arguably I shouldn't be deploying in my evening anyway :D) [17:30:41] RESOLVED: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:30:45] hehe, sounds good [17:31:00] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125504 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:31:01] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): serve 50% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125504 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:32:27] (03Merged) 10jenkins-bot: mw-(api-ext|web): serve 50% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125504 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:34:19] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:34:31] (03Merged) 10jenkins-bot: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1115380 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:34:39] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:35:18] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:35:30] (03CR) 10FNegri: [C:03+1] cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [17:35:38] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:35:57] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1011.eqiad.wmnet with reason: host reimage [17:37:10] (03CR) 10Mvolz: [C:03+1] "QA approved group 1." [puppet] - 10https://gerrit.wikimedia.org/r/1125461 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [17:38:08] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:38:20] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:38:42] (03PS2) 10Effie Mouzeli: hieradata: switch mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845) [17:38:51] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:39:04] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:39:55] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1011.eqiad.wmnet with reason: host reimage [17:43:02] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:43:16] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:43:32] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:43:49] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:43:58] (03PS3) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main (2/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [17:45:59] (03PS3) 10Effie Mouzeli: hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845) [17:46:14] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:46:23] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:46:32] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:46:43] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:47:34] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10625346 (10BCornwall) 05Stalled→03Resolved Hi, @YLiou_WMF, I'm going to close this... [17:48:11] !log mw-(api-ext|web): migrated 50% of residual PHP 7.4 traffic to 8.1 - T383845 [17:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:15] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:49:26] alright, I am now done with the window [17:54:37] (03CR) 10RLazarus: switchdc: stop and restart crons as part of swithover process (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [17:58:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1011.eqiad.wmnet with OS bullseye [17:59:18] (03CR) 10Andrea Denisse: [C:03+2] alert: Remove stale vops-bot-sync-db* service [puppet] - 10https://gerrit.wikimedia.org/r/1126128 (https://phabricator.wikimedia.org/T388444) (owner: 10Andrea Denisse) [18:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1800) [18:04:12] (03PS1) 10David Caro: clouceph: enable qos in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) [18:05:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:10:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:11:31] (03PS1) 10Andrea Denisse: Revert "alert: Remove stale vops-bot-sync-db* service" [puppet] - 10https://gerrit.wikimedia.org/r/1126620 [18:11:32] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126621 (https://phabricator.wikimedia.org/T386215) [18:11:34] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126621 (https://phabricator.wikimedia.org/T386215) (owner: 10TrainBranchBot) [18:12:24] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126621 (https://phabricator.wikimedia.org/T386215) (owner: 10TrainBranchBot) [18:12:26] (03CR) 10Andrea Denisse: [C:03+2] Revert "alert: Remove stale vops-bot-sync-db* service" [puppet] - 10https://gerrit.wikimedia.org/r/1126620 (owner: 10Andrea Denisse) [18:13:23] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10625540 (10BCornwall) [18:13:41] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10625542 (10BCornwall) a:03MoritzMuehlenhoff [18:14:04] (03PS2) 10Bking: cloudelastic: migrate cloudelastic1011 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1125229 [18:16:15] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw enable bgp - https://phabricator.wikimedia.org/T388586 (10herron) 03NEW [18:16:49] (03PS1) 10Herron: homer: aux-k8s-codfw: add ASN [homer/public] - 10https://gerrit.wikimedia.org/r/1126622 (https://phabricator.wikimedia.org/T388586) [18:17:17] (03CR) 10Bking: [C:03+2] cloudelastic: migrate cloudelastic1011 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1125229 (owner: 10Bking) [18:19:04] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw enable bgp - https://phabricator.wikimedia.org/T388586#10625606 (10herron) Set BGP back to false in netbox for aux-k8s-worker2* for the time being [18:19:21] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1011.eqiad.wmnet with OS bullseye [18:19:31] FIRING: Device rebooted: Alert for device ps1-a2-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [18:19:40] (03CR) 10Scott French: [C:03+1] "Ah, good question. No objection on first principles, but I also have no idea how relevant these are in practice for the MediaWiki use case" [puppet] - 10https://gerrit.wikimedia.org/r/1123625 (https://phabricator.wikimedia.org/T387509) (owner: 10Hnowlan) [18:24:31] RESOLVED: Device rebooted: Device ps1-a2-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [18:25:01] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.20 refs T386215 [18:25:04] T386215: 1.44.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T386215 [18:25:52] (03CR) 10BCornwall: [V:03+1 C:03+2] haproxy/icinga: Remove RSA from auth algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [18:25:55] (03CR) 10BCornwall: [C:03+2] haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [18:30:12] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1011.eqiad.wmnet with reason: host reimage [18:32:49] (03CR) 10Scott French: [C:03+1] "Thanks, Effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [18:34:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1011.eqiad.wmnet with reason: host reimage [18:35:57] (03CR) 10Tjones: [C:03+2] Add sudachi analyzer for japanese [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1125533 (https://phabricator.wikimedia.org/T386868) (owner: 10Ebernhardson) [18:37:01] (03PS5) 10Abijeet Patro: AX: Add quick survey for MinT for Wikireaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) [18:41:07] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2211 - https://phabricator.wikimedia.org/T388295#10625762 (10Jhancock.wm) disk has been replaced. all alerts i can see have cleared. please let us know if everything looks good on your end. return tracking: 286282047358 [18:46:31] (03CR) 10Scott French: [C:03+1] "Thanks for following up with the puppet patch! That LGTM, along with the update to the task description outlining the procedure." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [18:49:48] (03CR) 10Tjones: [V:03+2 C:03+2] Add sudachi analyzer for japanese [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1125533 (https://phabricator.wikimedia.org/T386868) (owner: 10Ebernhardson) [18:58:10] (03CR) 10Cathal Mooney: [C:03+1] "These changes are fine, however there are a few other bits we also need to add for this to work, so please don't merge I'll take a look in" [homer/public] - 10https://gerrit.wikimedia.org/r/1126622 (https://phabricator.wikimedia.org/T388586) (owner: 10Herron) [18:59:39] (03CR) 10Herron: "Great, thanks for the quick reply! I'll hang tight" [homer/public] - 10https://gerrit.wikimedia.org/r/1126622 (https://phabricator.wikimedia.org/T388586) (owner: 10Herron) [19:00:41] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1011.eqiad.wmnet with OS bullseye [19:02:00] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw enable bgp - https://phabricator.wikimedia.org/T388586#10625895 (10cmooney) Hey @herron thanks for the patch, we do definitely need that. We'll also need to define the policy for th... [19:04:15] (03PS4) 10Scott French: mw-(api-ext|web): serve 100% of residual traffic on 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125505 (https://phabricator.wikimedia.org/T383845) [19:04:15] (03PS4) 10Scott French: mw-(api-ext|web): direct residual traffic back to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125506 (https://phabricator.wikimedia.org/T383845) [19:04:15] (03PS5) 10Scott French: mw-(api-ext|web): scale main up to normal multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125507 (https://phabricator.wikimedia.org/T383845) [19:04:16] (03PS6) 10Scott French: mw-(api-ext|web): scale next down to 1 replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125508 (https://phabricator.wikimedia.org/T383845) [19:04:23] !log bking@cumin2002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for cloudelastic1011.eqiad.wmnet: Renew puppet certificate - bking@cumin2002 [19:05:43] jouncebot: nowandnext [19:05:43] For the next 0 hour(s) and 54 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T1800) [19:05:43] In 0 hour(s) and 54 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T2000) [19:06:03] (03PS1) 10Gergő Tisza: Silence TRX profiler in deferreds after autocreation [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126628 (https://phabricator.wikimedia.org/T388165) [19:07:14] (03PS2) 10Scott French: hieradata: switch all releases of mw-(api-ext|web) to 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1125501 (https://phabricator.wikimedia.org/T383845) [19:07:14] (03PS2) 10Scott French: trafficserver: revert cookie-enrolled traffic to main [puppet] - 10https://gerrit.wikimedia.org/r/1125502 (https://phabricator.wikimedia.org/T383845) [19:08:07] (03CR) 10Scott French: "Thanks for the review, Effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1125501 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [19:09:10] (03PS1) 10Gergő Tisza: Silence TRX profiler in deferreds after autocreation [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126630 (https://phabricator.wikimedia.org/T388165) [19:09:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126628 (https://phabricator.wikimedia.org/T388165) (owner: 10Gergő Tisza) [19:09:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126630 (https://phabricator.wikimedia.org/T388165) (owner: 10Gergő Tisza) [19:11:49] (03PS1) 10Reedy: api: guard against undefined prop relations [extensions/LiquidThreads] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126633 (https://phabricator.wikimedia.org/T384627) [19:11:58] (03PS1) 10Reedy: api: guard against undefined prop relations [extensions/LiquidThreads] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126634 (https://phabricator.wikimedia.org/T384627) [19:12:11] jeena: ^^ If you want to squash most of that logspam [19:12:31] I'd offer to deploy them now, but I'm about to head out for a bit [19:16:53] 👍 [19:21:33] (03PS1) 10Scott French: package_builder: remove pbuilder hook for pcre2 component (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1125537 (https://phabricator.wikimedia.org/T386006) [19:21:33] (03CR) 10Scott French: "If you might be able to review this and the next patch, that would be greatly appreciated. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1125537 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [19:21:35] (03PS1) 10Scott French: package_builder: remove pbuilder hook for pcre2 component (2 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1125538 (https://phabricator.wikimedia.org/T386006) [19:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10625951 (10phaultfinder) [19:25:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/LiquidThreads] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126634 (https://phabricator.wikimedia.org/T384627) (owner: 10Reedy) [19:25:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/LiquidThreads] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126633 (https://phabricator.wikimedia.org/T384627) (owner: 10Reedy) [19:26:11] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, but see inline comment for a future, simpler approach." [puppet] - 10https://gerrit.wikimedia.org/r/1125537 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [19:26:20] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1125538 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [19:26:32] (03Merged) 10jenkins-bot: api: guard against undefined prop relations [extensions/LiquidThreads] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126634 (https://phabricator.wikimedia.org/T384627) (owner: 10Reedy) [19:26:41] (03Merged) 10jenkins-bot: api: guard against undefined prop relations [extensions/LiquidThreads] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126633 (https://phabricator.wikimedia.org/T384627) (owner: 10Reedy) [19:27:17] (03PS1) 10Muehlenhoff: aptrepo: remove component/pcre2 [puppet] - 10https://gerrit.wikimedia.org/r/1125539 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [19:27:17] (03CR) 10Muehlenhoff: [C:03+1] "LGTM (although it's also fine to simply keep it, we usually keep unused components around until we remove the entire FOO-wikimedia suite." [puppet] - 10https://gerrit.wikimedia.org/r/1125539 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [19:27:17] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1126634|api: guard against undefined prop relations (T384627)]], [[gerrit:1126633|api: guard against undefined prop relations (T384627)]] [19:27:21] T384627: LiquidThreads: PHP Notice: Undefined index: root / PHP Warning: Undefined array key "root" - https://phabricator.wikimedia.org/T384627 [19:30:14] !log jhuneidi@deploy2002 reedy, jhuneidi: Backport for [[gerrit:1126634|api: guard against undefined prop relations (T384627)]], [[gerrit:1126633|api: guard against undefined prop relations (T384627)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:30:43] !log jhuneidi@deploy2002 reedy, jhuneidi: Continuing with sync [19:30:56] RECOVERY - Dell PowerEdge RAID Controller on db2211 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [19:37:11] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126634|api: guard against undefined prop relations (T384627)]], [[gerrit:1126633|api: guard against undefined prop relations (T384627)]] (duration: 09m 53s) [19:37:14] T384627: LiquidThreads: PHP Notice: Undefined index: root / PHP Warning: Undefined array key "root" - https://phabricator.wikimedia.org/T384627 [19:43:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:43:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:45:45] (03PS1) 10Bking: cirrus: Ensure opensearch rundirs are created [puppet] - 10https://gerrit.wikimedia.org/r/1126643 (https://phabricator.wikimedia.org/T387904) [19:46:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126643 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [19:47:23] (03CR) 10JHathaway: sqlite: require sqlite::package in 'file' db resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126425 (https://phabricator.wikimedia.org/T387112) (owner: 10Filippo Giunchedi) [19:51:13] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [19:51:21] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [19:53:16] (03PS1) 10BCornwall: varnish: Don't crash slowlog if tag has no value [puppet] - 10https://gerrit.wikimedia.org/r/1126646 (https://phabricator.wikimedia.org/T388597) [19:53:18] (03PS1) 10BCornwall: varnish: add log filters to slowquery logs [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) [19:53:38] (03CR) 10CI reject: [V:04-1] varnish: Don't crash slowlog if tag has no value [puppet] - 10https://gerrit.wikimedia.org/r/1126646 (https://phabricator.wikimedia.org/T388597) (owner: 10BCornwall) [19:53:49] (03CR) 10CI reject: [V:04-1] varnish: add log filters to slowquery logs [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:56:34] (03PS1) 10Gergő Tisza: Enable SUL3 signup for 50% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126648 (https://phabricator.wikimedia.org/T384218) [19:57:32] (03CR) 10BCornwall: "Example output:" [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:57:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126648 (https://phabricator.wikimedia.org/T384218) (owner: 10Gergő Tisza) [19:59:39] (03PS2) 10Bking: cirrus: Ensure opensearch rundirs are created [puppet] - 10https://gerrit.wikimedia.org/r/1126643 (https://phabricator.wikimedia.org/T387904) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T2000). [20:00:05] kimberly_sarabia and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:37] hey im here [20:02:26] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10626018 (10Jhancock.wm) [20:02:32] o/ [20:03:21] Hi, I can backport [20:04:07] ty [20:04:13] kimberly_sarabia: I'll start with your patch [20:04:24] ok im ready [20:04:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T387768) (owner: 10Kimberly Sarabia) [20:05:40] (03Merged) 10jenkins-bot: Deploy donate banner to test wiki for event logging testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T387768) (owner: 10Kimberly Sarabia) [20:06:10] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1126140|Deploy donate banner to test wiki for event logging testing (T387768)]] [20:06:15] T387768: Fix and QA donate link instrumentation - https://phabricator.wikimedia.org/T387768 [20:08:49] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126643 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [20:09:08] !log jhuneidi@deploy2002 ksarabia, jhuneidi: Backport for [[gerrit:1126140|Deploy donate banner to test wiki for event logging testing (T387768)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:09:43] Lookin good@ [20:10:47] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1012* for ban host prior to reimage - bking@cumin2002 - T387904 [20:10:50] T387904: Migrate Cloudelastic to Opensearch - https://phabricator.wikimedia.org/T387904 [20:10:52] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1012* for ban host prior to reimage - bking@cumin2002 - T387904 [20:12:13] thanks kimberly_sarabia [20:12:20] !log jhuneidi@deploy2002 ksarabia, jhuneidi: Continuing with sync [20:12:35] thanks! [20:13:56] (03CR) 10Scott French: "Thank you, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1125537 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [20:14:57] (03CR) 10Scott French: [C:03+2] package_builder: remove pbuilder hook for pcre2 component (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1125537 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [20:16:56] (03PS1) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) [20:17:54] (03PS4) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main (2/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [20:18:44] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126140|Deploy donate banner to test wiki for event logging testing (T387768)]] (duration: 12m 33s) [20:18:48] T387768: Fix and QA donate link instrumentation - https://phabricator.wikimedia.org/T387768 [20:19:40] tgr_: ready? [20:20:09] yes [20:20:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126628 (https://phabricator.wikimedia.org/T388165) (owner: 10Gergő Tisza) [20:20:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126630 (https://phabricator.wikimedia.org/T388165) (owner: 10Gergő Tisza) [20:21:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [20:21:57] (03PS2) 10Herron: add aux-k8s-codfw to environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126568 (https://phabricator.wikimedia.org/T381417) [20:25:06] isaranto: I guess these liftwing error rates are still a thing? [20:25:10] ^^^ [20:26:25] FYI it's 18 hours since s.wfrench-wmf added an 18-hour silence [20:26:34] was just about to say :) [20:26:36] so the error rate is still a thing but nothing changed just now except the silence expiring [20:26:55] `Expired 9 minutes ago` [20:27:10] I can create it again if desired (I have it open) [20:27:15] urandom: yes. We fixed one of the services ( reference-risk) but the other one still needs work [20:27:17] ah, u.random just did :) [20:27:18] swfrench-wmf: I just did [20:28:06] Can we add the silence for one more day? I will make sure to follow up and do sth either for the service or the alert :) [20:28:17] Thank you folks and sorry for the noise [20:28:24] one more meaning one day from now? [20:28:34] (that's what I did, fwiw) [20:32:55] (03Merged) 10jenkins-bot: Silence TRX profiler in deferreds after autocreation [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126628 (https://phabricator.wikimedia.org/T388165) (owner: 10Gergő Tisza) [20:35:25] (03Merged) 10jenkins-bot: Silence TRX profiler in deferreds after autocreation [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126630 (https://phabricator.wikimedia.org/T388165) (owner: 10Gergő Tisza) [20:35:56] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1126628|Silence TRX profiler in deferreds after autocreation (T388165)]], [[gerrit:1126630|Silence TRX profiler in deferreds after autocreation (T388165)]] [20:35:59] T388165: "Expectation not met" warnings during SUL autologin autocreation - https://phabricator.wikimedia.org/T388165 [20:37:16] (03PS1) 10Ebernhardson: Ensure opensearch package is installed before configuring [puppet] - 10https://gerrit.wikimedia.org/r/1126653 (https://phabricator.wikimedia.org/T387904) [20:37:38] (03CR) 10CI reject: [V:04-1] Ensure opensearch package is installed before configuring [puppet] - 10https://gerrit.wikimedia.org/r/1126653 (https://phabricator.wikimedia.org/T387904) (owner: 10Ebernhardson) [20:38:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126653 (https://phabricator.wikimedia.org/T387904) (owner: 10Ebernhardson) [20:38:58] (03PS2) 10Ebernhardson: Ensure opensearch package is installed before configuring [puppet] - 10https://gerrit.wikimedia.org/r/1126653 (https://phabricator.wikimedia.org/T387904) [20:39:04] !log jhuneidi@deploy2002 jhuneidi, tgr: Backport for [[gerrit:1126628|Silence TRX profiler in deferreds after autocreation (T388165)]], [[gerrit:1126630|Silence TRX profiler in deferreds after autocreation (T388165)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:39:17] (03PS3) 10Ebernhardson: Require opensearch package to be installed before configuring [puppet] - 10https://gerrit.wikimedia.org/r/1126653 (https://phabricator.wikimedia.org/T387904) [20:39:25] tgr_: ready for you to check on mwdebug [20:41:20] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126653 (https://phabricator.wikimedia.org/T387904) (owner: 10Ebernhardson) [20:42:22] jeena: looks good [20:42:28] !log jhuneidi@deploy2002 jhuneidi, tgr: Continuing with sync [20:42:31] thanks! [20:47:38] (03PS1) 10Brouberol: airflow: fix datahub connection host values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126655 (https://phabricator.wikimedia.org/T386282) [20:47:42] (03PS2) 10BCornwall: varnish: Don't crash slowlog if tag has no value [puppet] - 10https://gerrit.wikimedia.org/r/1126646 (https://phabricator.wikimedia.org/T388597) [20:47:42] (03PS2) 10BCornwall: varnish: add log filters to slowquery logs [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) [20:49:02] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126628|Silence TRX profiler in deferreds after autocreation (T388165)]], [[gerrit:1126630|Silence TRX profiler in deferreds after autocreation (T388165)]] (duration: 13m 05s) [20:49:06] T388165: "Expectation not met" warnings during SUL autologin autocreation - https://phabricator.wikimedia.org/T388165 [20:49:55] thanks for the deploy! [20:50:40] (03CR) 10Dzahn: [C:03+2] create codesearch.wikimedia.org, point to standard DYNA [dns] - 10https://gerrit.wikimedia.org/r/1126176 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [20:50:45] !log dzahn@dns1004 START - running authdns-update [20:51:53] you're welcome! [20:52:16] (03CR) 10Bking: [C:03+1] "Looks good to me, but would like to have someone from 0lly to take a look before we merge." [puppet] - 10https://gerrit.wikimedia.org/r/1126653 (https://phabricator.wikimedia.org/T387904) (owner: 10Ebernhardson) [20:52:54] !log dzahn@dns1004 END - running authdns-update [20:56:50] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:57:14] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T2100) [21:04:50] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:05:14] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:09:30] (03PS4) 10Ryan Kemper: Require opensearch package to be installed before configuring [puppet] - 10https://gerrit.wikimedia.org/r/1126653 (https://phabricator.wikimedia.org/T387904) (owner: 10Ebernhardson) [21:13:56] (03CR) 10Scott French: [C:03+2] package_builder: remove pbuilder hook for pcre2 component (2 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1125538 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [21:14:26] (03PS2) 10Scott French: package_builder: remove pbuilder hook for pcre2 component (2 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1125538 (https://phabricator.wikimedia.org/T386006) [21:14:37] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, tho I'd wait for Cole's feedback." [puppet] - 10https://gerrit.wikimedia.org/r/1126653 (https://phabricator.wikimedia.org/T387904) (owner: 10Ebernhardson) [21:15:56] jouncebot: nowandnext [21:15:56] For the next 0 hour(s) and 44 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250311T2100) [21:15:56] In 8 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0600) [21:15:58] (03CR) 10Scott French: [C:03+2] package_builder: remove pbuilder hook for pcre2 component (2 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1125538 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [21:16:43] (03PS1) 10Jforrester: Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126659 [21:16:43] (03PS1) 10Jforrester: [test2wiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126660 (https://phabricator.wikimedia.org/T383106) [21:16:44] (03PS1) 10Jforrester: [wikifunctionswiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126661 (https://phabricator.wikimedia.org/T383106) [21:16:46] (03PS1) 10Jforrester: [dagwiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126662 (https://phabricator.wikimedia.org/T383106) [21:17:26] (03CR) 10CI reject: [V:04-1] Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126659 (owner: 10Jforrester) [21:17:26] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5054/console" [puppet] - 10https://gerrit.wikimedia.org/r/1126646 (https://phabricator.wikimedia.org/T388597) (owner: 10BCornwall) [21:17:35] (03CR) 10Federico Ceratto: [C:03+1] Implement Icinga notification check before pooling in a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [21:17:37] (03CR) 10Federico Ceratto: [C:03+2] Implement Icinga notification check before pooling in a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [21:18:50] (03PS2) 10Jforrester: Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126659 [21:18:50] (03PS2) 10Jforrester: [test2wiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126660 (https://phabricator.wikimedia.org/T383106) [21:18:50] (03PS2) 10Jforrester: [wikifunctionswiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126661 (https://phabricator.wikimedia.org/T383106) [21:18:51] (03PS2) 10Jforrester: [dagwiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126662 (https://phabricator.wikimedia.org/T383106) [21:19:34] (03CR) 10CI reject: [V:04-1] Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126659 (owner: 10Jforrester) [21:20:49] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5055/console" [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:23:51] (03CR) 10Jforrester: [C:03+1] CommonSettings.php: Remove reference to scandium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126156 (owner: 10Subramanya Sastry) [21:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626467 (10phaultfinder) [21:24:49] (03Merged) 10jenkins-bot: Implement Icinga notification check before pooling in a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [21:31:24] (03PS1) 10Ryan Kemper: Bump changelog version for sudachi analyzer [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1126663 (https://phabricator.wikimedia.org/T386868) [21:38:11] FIRING: Temperature: Temp issue on wdqs1021:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1021 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:39:35] (03CR) 10Bking: [C:03+1] Bump changelog version for sudachi analyzer [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1126663 (https://phabricator.wikimedia.org/T386868) (owner: 10Ryan Kemper) [21:39:38] (03CR) 10Ebernhardson: [C:03+2] Bump changelog version for sudachi analyzer [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1126663 (https://phabricator.wikimedia.org/T386868) (owner: 10Ryan Kemper) [21:40:12] (03PS5) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [21:40:15] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10626536 (10Papaul) lsw2-e8-codfw transceiver and cable type ` --{ + state }--[ interface ethernet-1/10 ]-- transceiver {... [21:41:57] (03PS5) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [21:41:57] (03CR) 10Effie Mouzeli: "needs to be consolidated with Ib36d255010f67779f0ac0a98d07d60d5c7453845 before being ready for review" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [21:42:36] (03CR) 10BCornwall: "Acknowledged" [dns] - 10https://gerrit.wikimedia.org/r/1124178 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [21:43:11] RESOLVED: Temperature: Temp issue on wdqs1021:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1021 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:45:25] (03PS2) 10Ryan Kemper: Bump changelog version for sudachi analyzer [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1126663 (https://phabricator.wikimedia.org/T386868) [21:53:14] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:53:50] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:54:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626578 (10phaultfinder) [21:58:16] (03CR) 10Scott French: [C:03+1] "Thanks, Effie! This LGTM together with I03fea17c397abd0f03350cd636fbeebb40bc33f9. Also, the procedure described in the commit message look" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [21:59:03] (03PS6) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [21:59:41] !log reedy@deploy2002 Synchronized private/: various cleanup (duration: 08m 45s) [22:01:14] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:01:27] (03PS2) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) [22:01:48] (03PS7) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [22:01:50] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:02:00] (03CR) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [22:04:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626620 (10phaultfinder) [22:06:53] (03CR) 10JHathaway: puppetserver: add option to manage git permissions with an acl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [22:07:21] (03PS3) 10JHathaway: puppetserver: add option to manage git permissions with an acl [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) [22:08:48] (03CR) 10Scott French: "Thanks, Effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [22:09:07] (03CR) 10JHathaway: "@ltoscano@wikimedia.org updated patch with removal logic added. If this looks okay, my thought is to use per host hiera settings to role t" [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [22:09:35] (03CR) 10CI reject: [V:04-1] puppetserver: add option to manage git permissions with an acl [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [22:10:45] (03CR) 10Ebernhardson: [C:04-1] "As discussed, i think Iecbb4bd28e is a better approach to the same problem" [puppet] - 10https://gerrit.wikimedia.org/r/1126643 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [22:11:32] (03CR) 10Scott French: [C:03+1] "Still looks good! One commit-message nit now that the deployment-charts patch has been split." [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [22:15:19] (03PS4) 10JHathaway: puppetserver: add option to manage git permissions with an acl [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) [22:16:08] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#10626638 (10Dzahn) @Urbanecm What I... [22:16:14] (03CR) 10Scott French: [C:03+1] "Thank you for splitting this up! Updated patches and procedure in the commit message LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [22:17:27] (03CR) 10CI reject: [V:04-1] puppetserver: add option to manage git permissions with an acl [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [22:27:51] (03PS2) 10JHathaway: puppet: add an ACL puppet module [puppet] - 10https://gerrit.wikimedia.org/r/1125245 (https://phabricator.wikimedia.org/T385995) [22:27:51] (03PS5) 10JHathaway: puppetserver: add option to manage git permissions with an acl [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) [22:30:26] (03CR) 10CI reject: [V:04-1] puppet: add an ACL puppet module [puppet] - 10https://gerrit.wikimedia.org/r/1125245 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [22:31:57] (03PS3) 10JHathaway: puppet: add an ACL puppet module [puppet] - 10https://gerrit.wikimedia.org/r/1125245 (https://phabricator.wikimedia.org/T385995) [22:31:57] (03PS6) 10JHathaway: puppetserver: add option to manage git permissions with an acl [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) [22:36:51] (03PS1) 10Dzahn: lists/stewards: allow switching some lists to actual "wet"-run [puppet] - 10https://gerrit.wikimedia.org/r/1126676 (https://phabricator.wikimedia.org/T388354) [23:05:13] (03CR) 10Cwhite: [C:03+1] "PCC seems happy and we'll only know if it works when we do new provisioning." [puppet] - 10https://gerrit.wikimedia.org/r/1126653 (https://phabricator.wikimedia.org/T387904) (owner: 10Ebernhardson) [23:08:08] (03PS1) 10Reedy: CommonSettings.php: Set virtual-bouncehandler domain mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126678 [23:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626840 (10phaultfinder) [23:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626850 (10phaultfinder) [23:59:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626870 (10phaultfinder)