[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T0000) [00:01:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1110.eqiad.wmnet with OS bullseye [00:01:11] (03PS1) 10Bking: elastic: bring repurposed hosts back into elastic clusters [puppet] - 10https://gerrit.wikimedia.org/r/1124203 (https://phabricator.wikimedia.org/T387782) [00:01:31] (03PS1) 10Dwisehaupt: community_civicrm: add stub for dovecot_passwd [labs/private] - 10https://gerrit.wikimedia.org/r/1124204 (https://phabricator.wikimedia.org/T383715) [00:01:35] (03CR) 10CI reject: [V:04-1] elastic: bring repurposed hosts back into elastic clusters [puppet] - 10https://gerrit.wikimedia.org/r/1124203 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking) [00:02:32] (03PS2) 10Bking: elastic: bring repurposed hosts back into elastic clusters [puppet] - 10https://gerrit.wikimedia.org/r/1124203 (https://phabricator.wikimedia.org/T387782) [00:02:45] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124203 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking) [00:04:32] (03CR) 10Arlolra: [C:04-1] "pfragments were reverted, we can't deploy this yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra) [00:04:52] (03PS1) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) [00:05:13] (03CR) 10CI reject: [V:04-1] community_civicrm: dovecot module for serving up local mail [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [00:06:49] (03PS3) 10Bking: elastic: bring repurposed hosts back into elastic clusters [puppet] - 10https://gerrit.wikimedia.org/r/1124203 (https://phabricator.wikimedia.org/T387782) [00:06:57] (03PS2) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) [00:08:06] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124203 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking) [00:11:39] (03CR) 10Ryan Kemper: [C:03+1] elastic: bring repurposed hosts back into elastic clusters [puppet] - 10https://gerrit.wikimedia.org/r/1124203 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking) [00:12:49] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598919 (10thcipriani) I know we use some of these images to build subsequent images. It //may be// that the only images we need to keep are the ones listed i... [00:30:05] (03PS2) 10Scott French: php8.1: Default display_startup_errors to "stderr" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124194 (https://phabricator.wikimedia.org/T377038) [00:34:32] !log deleting older mw-multiversion images on deploy2002 to free space (T387796) [00:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:35] T387796: deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796 [00:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1124208 [00:38:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1124208 (owner: 10TrainBranchBot) [00:38:41] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10598982 (10Papaul) @Jhancock.wm yes both nodes are sending the request to the old puppet server ` "backup2013.codfw.wmnet" (SHA256) 58:41:--------60:EA:EA:C5:D0:5C:CF:16:25:DD... [00:39:16] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [00:41:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:42:40] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598993 (10dduvall) I've removed a bunch of old images for now. I will talk with Tyler and Ahmon tomorrow about long term solutions. For future reference, sc... [00:43:34] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598994 (10dduvall) Even after the deletion, there is still a lot of reclaimable space: ` dduvall@deploy2002:~$ docker system df TYPE TOTAL AC... [00:47:05] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10599005 (10thcipriani) >>! In T387796#10598965, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.o... [00:48:21] RECOVERY - Disk space on deploy2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [00:50:14] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2045 to codfw - jhancock@cumin2002" [00:50:15] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1124208 (owner: 10TrainBranchBot) [00:50:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2045 to codfw - jhancock@cumin2002" [00:50:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:52:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [00:53:21] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [00:56:04] (03PS1) 10Scott French: mw-api-int: serve 25% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124210 (https://phabricator.wikimedia.org/T383845) [00:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:59:47] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2046-8 to codfw - jhancock@cumin2002" [00:59:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2046-8 to codfw - jhancock@cumin2002" [00:59:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:08:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1124213 [01:08:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1124213 (owner: 10TrainBranchBot) [01:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10599069 (10phaultfinder) [01:16:11] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 245, active_shards: 427, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 59, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_m [01:16:11] ng_in_queue_millis: 0, active_shards_percent_as_number: 87.32106339468302 https://wikitech.wikimedia.org/wiki/Search%23Administration [01:16:11] RECOVERY - OpenSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, discovered_master: True, active_primary_shards: 245, active_shards: 427, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 59, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_f [01:16:11] tch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.32106339468302 https://wikitech.wikimedia.org/wiki/Search%23Administration [01:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10599081 (10phaultfinder) [01:32:15] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1124213 (owner: 10TrainBranchBot) [01:41:28] (03PS1) 10BryanDavis: ats-tls: Pass X-Analytics when X-Wikimedia-Debug is active [puppet] - 10https://gerrit.wikimedia.org/r/1124216 (https://phabricator.wikimedia.org/T305794) [01:42:29] 06SRE, 06Traffic, 10WikimediaDebug, 07Developer Productivity, 13Patch-For-Review: Let X-Analytics response header pass through with WikimediaDebug - https://phabricator.wikimedia.org/T305794#10599124 (10bd808) 05Open→03In progress a:03bd808 [01:43:38] (03CR) 10CI reject: [V:04-1] ats-tls: Pass X-Analytics when X-Wikimedia-Debug is active [puppet] - 10https://gerrit.wikimedia.org/r/1124216 (https://phabricator.wikimedia.org/T305794) (owner: 10BryanDavis) [01:47:29] (03PS2) 10BryanDavis: ats-tls: Pass X-Analytics when X-Wikimedia-Debug is active [puppet] - 10https://gerrit.wikimedia.org/r/1124216 (https://phabricator.wikimedia.org/T305794) [01:58:59] (03PS3) 10Scott French: php8.1: Set display_startup_errors consistent with display_errors [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124194 (https://phabricator.wikimedia.org/T377038) [02:01:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:08:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.19 [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124219 (https://phabricator.wikimedia.org/T386214) [02:08:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.19 [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124219 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [02:18:24] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387811 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [02:18:36] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387811 (10ops-monitoring-bot) 03NEW [02:20:52] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.19 [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124219 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [02:24:11] (03CR) 10Scott French: [C:04-1] "I did get a brief chance to think on this more today, and really either of these approaches would work." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [02:28:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10599156 (10VRiley-WMF) [02:31:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T0300) [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:57:26] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10599239 (10Dzahn) p:05High→03Medium [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T0400) [04:01:35] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124226 (https://phabricator.wikimedia.org/T386214) [04:01:36] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124226 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [04:02:23] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124226 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [04:02:48] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.19 refs T386214 [04:02:51] T386214: 1.44.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T386214 [04:20:39] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [04:30:39] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [04:31:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:52:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [04:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T0500) [05:01:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:05:57] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.16 (duration: 05m 56s) [05:14:35] FIRING: SystemdUnitFailed: mediawiki_job_startupregistrystats-testwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:30:25] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-03-03-041049-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123940 (https://phabricator.wikimedia.org/T369815) (owner: 10KartikMistry) [05:31:10] Minor cxserver deployment ^ [05:32:03] (03Merged) 10jenkins-bot: Update cxserver to 2025-03-03-041049-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123940 (https://phabricator.wikimedia.org/T369815) (owner: 10KartikMistry) [05:39:03] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:43:25] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:43:52] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:50:44] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:51:16] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:51:34] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:52:09] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:53:43] !log Updated cxserver to 2025-03-03-041049-production (T369815, T387037) [05:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:47] T369815: Enable in content Translation the new languages Google Translate supports in June 2024 - https://phabricator.wikimedia.org/T369815 [05:53:48] T387037: Set Google machine translation option as default in the Content Translation tool on Kannada Wikipedia - https://phabricator.wikimedia.org/T387037 [06:09:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1221 db2147', diff saved to https://phabricator.wikimedia.org/P74014 and previous config saved to /var/cache/conftool/dbconfig/20250304-060927-marostegui.json [06:10:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Rebuilding index [06:10:29] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2147.codfw.wmnet [06:10:40] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1221.eqiad.wmnet [06:11:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2167 db1211', diff saved to https://phabricator.wikimedia.org/P74015 and previous config saved to /var/cache/conftool/dbconfig/20250304-061152-marostegui.json [06:12:10] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2167.codfw.wmnet [06:12:21] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1211.eqiad.wmnet [06:13:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1020.eqiad.wmnet with reason: Rebuilding index [06:16:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1221.eqiad.wmnet [06:17:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2147.codfw.wmnet [06:17:29] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1221.eqiad.wmnet with reason: Index rebuild [06:17:47] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Index rebuild [06:17:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1211.eqiad.wmnet [06:17:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2167.codfw.wmnet [06:18:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1211.eqiad.wmnet with reason: Index rebuild [06:18:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Index rebuild [06:27:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1163 with weight 0 T387552', diff saved to https://phabricator.wikimedia.org/P74016 and previous config saved to /var/cache/conftool/dbconfig/20250304-062702-marostegui.json [06:27:05] T387552: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T387552 [06:27:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1163 from API/vslow/dump T387552', diff saved to https://phabricator.wikimedia.org/P74017 and previous config saved to /var/cache/conftool/dbconfig/20250304-062717-marostegui.json [06:27:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s1 T387552 [06:27:50] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1123594 (https://phabricator.wikimedia.org/T387552) (owner: 10Gerrit maintenance bot) [06:31:19] (03PS1) 10Marostegui: db1184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124331 [06:31:46] (03CR) 10Marostegui: [C:03+2] db1184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124331 (owner: 10Marostegui) [06:32:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1163 to s1 primary T387552', diff saved to https://phabricator.wikimedia.org/P74018 and previous config saved to /var/cache/conftool/dbconfig/20250304-063222-marostegui.json [06:32:30] T387552: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T387552 [06:33:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1184 T387552', diff saved to https://phabricator.wikimedia.org/P74019 and previous config saved to /var/cache/conftool/dbconfig/20250304-063320-marostegui.json [06:35:20] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1184.eqiad.wmnet [06:41:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1184.eqiad.wmnet [06:42:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1184.eqiad.wmnet with reason: Index rebuild [06:42:32] (03PS1) 10Marostegui: Revert "db1184: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1124332 [06:45:09] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1124335 (https://phabricator.wikimedia.org/T387816) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T0700) [07:00:05] marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T0700). [07:11:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:17:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114955 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [07:18:24] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387817 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:18:32] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387817 (10ops-monitoring-bot) 03NEW [07:29:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [07:29:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10599423 (10ops-monitoring-bot) Draining ganeti1031.eqiad.wmnet of running VMs [07:32:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet [07:34:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10599434 (10MoritzMuehlenhoff) [07:34:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3477 MB (3% inode=98%): /tmp 3477 MB (3% inode=98%): /var/tmp 3477 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [07:35:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1005.eqiad.wmnet to drbd [07:36:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10599435 (10ops-monitoring-bot) VM kubestagemaster1005.eqiad.wmnet switching disk type to drbd [07:39:03] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:39:27] (03CR) 10Marostegui: [C:03+2] Revert "db1184: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1124332 (owner: 10Marostegui) [07:41:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:45:32] (03PS1) 10Sergio Gimeno: analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124360 (https://phabricator.wikimedia.org/T387286) [07:47:30] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10599476 (10fgiunchedi) Also note that standard partman recipes nowadays don't allocate all space on LVM for emergency response purposes, e.g. ` deploy2002:~$... [07:47:39] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744#10599478 (10MoritzMuehlenhoff) 05Open→03Resolved Closing out this old task as part of clinic duty, these VMs are long in production. [07:50:29] (03PS1) 10Muehlenhoff: Switch ganeti1031 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1124361 [07:51:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster1005.eqiad.wmnet to drbd [07:51:21] PROBLEM - Host kubestagemaster1005 is DOWN: PING CRITICAL - Packet loss = 100% [07:51:37] (03PS1) 10Sergio Gimeno: analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124362 (https://phabricator.wikimedia.org/T387286) [07:51:47] RECOVERY - Host kubestagemaster1005 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [07:52:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124360 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [07:52:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124362 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [07:52:24] FIRING: [2x] ProbeDown: Service kubestagemaster1005:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:53:26] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387597#10599523 (10fgiunchedi) >>! In T387597#10598127, @Dzahn wrote: > Or maybe I'm wrong and it doesn't create a single task anymore if it happens for different hosts. The bot will update the existing open... [07:53:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [07:53:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10599526 (10ops-monitoring-bot) Draining ganeti1031.eqiad.wmnet of running VMs [07:54:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet [07:54:35] RESOLVED: [2x] ProbeDown: Service kubestagemaster1005:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:57:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1005.eqiad.wmnet to plain [07:57:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10599531 (10ops-monitoring-bot) VM kubestagemaster1005.eqiad.wmnet switching disk type to plain [07:58:35] (03CR) 10Sergio Gimeno: [C:03+2] analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124360 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [07:59:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster1005.eqiad.wmnet to plain [07:59:17] PROBLEM - Host kubestagemaster1005 is DOWN: PING CRITICAL - Packet loss = 100% [07:59:47] RECOVERY - Host kubestagemaster1005 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [08:00:05] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T0800). [08:00:05] sergi0 and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [08:00:22] o/ [08:00:22] o/ [08:00:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10599563 (10ops-monitoring-bot) Draining ganeti1031.eqiad.wmnet of running VMs [08:00:45] (03CR) 10Brouberol: [C:03+2] airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700) (owner: 10Brouberol) [08:02:26] (03PS2) 10Sergio Gimeno: analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124360 (https://phabricator.wikimedia.org/T387286) [08:03:16] !log hashar@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.19 refs T386214 [08:03:19] T386214: 1.44.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T386214 [08:04:11] sergi0: dcausse: hi, I am fixing up the train right now [08:04:22] I forgot about the backport window [08:04:26] hashar: np! [08:04:35] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:41] (03PS2) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) [08:04:50] so scap is running currently being busy preparing for the mw train. That should take roughly ten minutes [08:05:12] ack [08:07:05] (03CR) 10CI reject: [V:04-1] systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) (owner: 10Fabfur) [08:07:31] (03PS3) 10Slyngshede: Upgrade to CAS 7.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1111636 [08:08:13] oh no [08:08:18] I think it created a large image [08:08:24] so that would end up taking an hour or so to deploy that [08:08:26] !log hashar@deploy2002 sync-world aborted: testwikis to 1.44.0-wmf.19 refs T386214 (duration: 05m 10s) [08:08:29] T386214: 1.44.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T386214 [08:08:58] dcausse: sergi0 go ahead, I will do the deployment later :) [08:09:25] hashar: ok :) [08:09:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:09:31] I forgot that the image to push would come with a brand new mw version and thus requires pushing a 9G image [08:10:48] I can deploy I guess [08:10:48] dcausse: are you self-deploying? I'm still waiting for CI for my non-config changes, so you can go ahead first if you want [08:10:59] sergi0: sure [08:11:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114955 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [08:12:31] (03Merged) 10jenkins-bot: cirrus: add v1 stream for the search update pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114955 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [08:12:55] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs5006 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124150 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [08:13:28] (03PS3) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) [08:13:29] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1114955|cirrus: add v1 stream for the search update pipeline (T375821)]] [08:13:32] T375821: Migrate streaming updater event schema to the standard schema repository - https://phabricator.wikimedia.org/T375821 [08:16:05] (03CR) 10CI reject: [V:04-1] systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) (owner: 10Fabfur) [08:17:19] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs5006.eqsin.wmnet with OS bookworm [08:18:25] (03PS5) 10Abijeet Patro: metawiki: Enable Chinese variant translation for message bundles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122632 (https://phabricator.wikimedia.org/T387230) [08:18:28] (03CR) 10Abijeet Patro: metawiki: Enable Chinese variant translation for message bundles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122632 (https://phabricator.wikimedia.org/T387230) (owner: 10Abijeet Patro) [08:18:42] (03CR) 10Sergio Gimeno: "resubmit" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124360 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [08:19:03] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:20:34] ^^ that's lvs5006 being reimaged [08:21:16] (03Merged) 10jenkins-bot: analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124360 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [08:22:08] (03PS1) 10Filippo Giunchedi: pyrra: limit search-update-lag to the correct site [puppet] - 10https://gerrit.wikimedia.org/r/1124365 (https://phabricator.wikimedia.org/T387827) [08:22:33] (03PS4) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) [08:24:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1015.eqiad.wmnet with reason: Rebuilding index [08:24:35] (03CR) 10Nikerabbit: [C:03+1] metawiki: Enable Chinese variant translation for message bundles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122632 (https://phabricator.wikimedia.org/T387230) (owner: 10Abijeet Patro) [08:24:37] hashar: "Finished build-and-push-container-images (duration: 09m 01s)" perhaps I pushed the image you mentioned? [08:28:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1221 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74020 and previous config saved to /var/cache/conftool/dbconfig/20250304-082819-root.json [08:29:16] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1114955|cirrus: add v1 stream for the search update pipeline (T375821)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:29:18] T375821: Migrate streaming updater event schema to the standard schema repository - https://phabricator.wikimedia.org/T375821 [08:30:56] hashar: seems like I'm running the train, test.wikipedia.org is 1.44.0-wmf.18 but 1.44.0-wmf.19 via debug servers [08:32:55] unsure if it's safe to proceed, never deployed the train before and I'm not clear what are the things you need to check [08:34:33] (03PS5) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) [08:35:00] dcausse: ahhhhhh [08:35:01] if there's a train driver around I'd love some help :) [08:35:31] hashar: good to see you around :P [08:36:05] ahh cause the overnight presync failed [08:36:17] but the testwikis are still bumpe dto the new wmf.19 [08:36:41] or given there is a newer image on the deployment server it is pushed regarldess [08:37:41] that messes happens whenever the overnight job fail [08:37:47] I am going to look in logstash for the scap progress [08:38:48] hashar: it's currently at the test step, haven't pressed 'y' yet, should I? [08:38:56] yes [08:38:59] ok [08:39:01] hmm [08:39:08] I mean that is to test your patch I guess [08:39:14] yes [08:39:24] I think it deployed both the train and my patch [08:39:28] so you can use X wikimedia debug to test your backport [08:39:30] else press Y :) [08:39:40] and yess that will promote the tests wikis to wmf.19 [08:39:57] which is ok, we do that automatically over night [08:39:58] hashar: my backport was tested [08:40:11] then accept to continue [08:40:14] Y + Return [08:40:16] ok [08:40:20] !log dcausse@deploy2002 dcausse: Continuing with sync [08:40:30] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5006.eqsin.wmnet with reason: host reimage [08:40:33] that would push to the canaries and prod [08:40:41] as usualy [08:40:42] usual [08:43:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1221 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74021 and previous config saved to /var/cache/conftool/dbconfig/20250304-084325-root.json [08:44:21] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5006.eqsin.wmnet with reason: host reimage [08:44:39] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [08:45:08] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [08:45:19] 06SRE, 10MW-on-K8s, 06serviceops, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629#10599692 (10JMeybohm) 05Resolved→03Open >>! In T288629#10598113, @dancy wrote: >>>! In T288629#10582102, @JMeybohm wrote:... [08:45:23] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [08:45:49] 06SRE, 10MW-on-K8s, 06serviceops, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629#10599694 (10JMeybohm) a:05dancy→03None [08:48:23] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387829 (10phaultfinder) 03NEW [08:50:30] dcausse: s yeah essentially you end up pushing everything :/ [08:50:43] I should have run the command immediately earlier this morning [08:50:55] or we can raise the timeout [08:51:29] hashar: np! glad that you were around [08:52:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74022 and previous config saved to /var/cache/conftool/dbconfig/20250304-085207-root.json [08:52:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [08:54:42] FIRING: JobUnavailable: Reduced availability for job liberica in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:54:47] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114955|cirrus: add v1 stream for the search update pipeline (T375821)]] (duration: 41m 17s) [08:54:50] T375821: Migrate streaming updater event schema to the standard schema repository - https://phabricator.wikimedia.org/T375821 [08:55:34] !log restarting eventgate-main to pickup to new streams (T375821) [08:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:36] dcausse: let me know when it's safe for me to start [08:56:39] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [08:56:58] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) (owner: 10Tiziano Fogli) [08:57:11] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [08:57:25] sergi0: I'm done, sorry it took so long, hashar is it OK to continue the backport window? [08:57:37] yes please do [08:57:39] it is fine to extend [08:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:57:48] sergi0: please go ahead [08:57:49] I will run the group0 promotion when you are done [08:57:51] perfect, ty! [08:57:56] sorry for the delay :/ [08:58:02] np! [08:58:02] !log aborrero@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1005.eqiad.wmnet [08:58:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1221 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74023 and previous config saved to /var/cache/conftool/dbconfig/20250304-085829-root.json [08:58:32] (03PS4) 10Slyngshede: Upgrade to CAS 7.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1111636 [08:59:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123607 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [08:59:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:59:42] RESOLVED: JobUnavailable: Reduced availability for job liberica in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:59:45] (03Merged) 10jenkins-bot: [Growth] Add mediawiki.product_metrics.growth_product_interaction stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123607 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [08:59:47] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [09:00:05] hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T0900) [09:00:06] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [09:00:19] (03CR) 10JMeybohm: [C:03+2] Update validating-admission-policies for k8s >=1.30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:00:56] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1123607|[Growth] Add mediawiki.product_metrics.growth_product_interaction stream config (T387286)]] [09:00:59] T387286: Track variant assignment on account creation - https://phabricator.wikimedia.org/T387286 [09:01:03] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:01:31] (03Merged) 10jenkins-bot: Update validating-admission-policies for k8s >=1.30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:02:24] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:22] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1123607|[Growth] Add mediawiki.product_metrics.growth_product_interaction stream config (T387286)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:07:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74024 and previous config saved to /var/cache/conftool/dbconfig/20250304-090712-root.json [09:08:33] !log sgimeno@deploy2002 sgimeno: Continuing with sync [09:09:33] (03PS1) 10Vgutierrez: hiera: Fix NIC name for liberica instances in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1124371 (https://phabricator.wikimedia.org/T384477) [09:12:24] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1221 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74025 and previous config saved to /var/cache/conftool/dbconfig/20250304-091334-root.json [09:13:40] (03CR) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [09:16:34] I am taking a break, I'll run it at 10am UTC (44 minutes from now) [09:16:57] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123607|[Growth] Add mediawiki.product_metrics.growth_product_interaction stream config (T387286)]] (duration: 16m 01s) [09:17:00] T387286: Track variant assignment on account creation - https://phabricator.wikimedia.org/T387286 [09:17:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124362 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [09:19:13] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:20:09] (03Merged) 10jenkins-bot: analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124362 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [09:20:38] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1124362|analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts (T387286)]] [09:21:01] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:21:59] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:22:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74026 and previous config saved to /var/cache/conftool/dbconfig/20250304-092217-root.json [09:22:24] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:23:17] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ipoid: apply [09:23:19] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ipoid: apply [09:23:28] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1124362|analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts (T387286)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:23:31] T387286: Track variant assignment on account creation - https://phabricator.wikimedia.org/T387286 [09:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:24:42] (03CR) 10DCausse: [C:03+1] wdqs-categories: use new split graph hosts (wdqs-main) for categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122151 (https://phabricator.wikimedia.org/T375520) (owner: 10Bking) [09:25:24] !log sgimeno@deploy2002 sgimeno: Continuing with sync [09:26:24] (03CR) 10Vgutierrez: [C:03+2] hiera: Fix NIC name for liberica instances in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1124371 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:26:26] (03PS3) 10DCausse: cirrus: drop cirrus_saneitize_jobs periodic job (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/1113461 [09:27:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Rebuilding index [09:28:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1221 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74027 and previous config saved to /var/cache/conftool/dbconfig/20250304-092839-root.json [09:28:45] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:32:39] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124362|analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts (T387286)]] (duration: 12m 01s) [09:32:41] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5006.eqsin.wmnet with OS bookworm [09:32:42] T387286: Track variant assignment on account creation - https://phabricator.wikimedia.org/T387286 [09:32:47] (03PS19) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [09:33:03] !log elukey@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:33:11] hashar: backport window finished [09:33:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:34:32] !log elukey@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:34:52] (03CR) 10Fabfur: [C:03+1] hiera: Fix NIC name for liberica instances in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1124371 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:37:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74028 and previous config saved to /var/cache/conftool/dbconfig/20250304-093723-root.json [09:39:24] !log elukey@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:39:28] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [09:41:00] !log elukey@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:43:06] 06SRE, 06SRE Observability: Ops-monitoring-bot creating duplicate tasks for the same RAID failure - https://phabricator.wikimedia.org/T387754#10599971 (10LSobanski) [09:48:30] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10599978 (10Peachey88) [09:48:33] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387817#10599981 (10Peachey88) →14Duplicate dup:03T382984 [09:51:13] 10ops-eqiad, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): openstack galera no recent writes 2025-03-04, suspected network hardware problem - https://phabricator.wikimedia.org/T387828#10599984 (10aborrero) Hey @VRiley-WMF and/or @Jclark-ctr, Could you please check on-site if there is a loose cab... [09:51:43] 10ops-eqiad, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): openstack galera no recent writes 2025-03-04, suspected network hardware problem - https://phabricator.wikimedia.org/T387828#10599987 (10aborrero) a:03VRiley-WMF [09:52:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74029 and previous config saved to /var/cache/conftool/dbconfig/20250304-095228-root.json [09:52:49] !log aborrero@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudcontrol1005.eqiad.wmnet [10:00:30] !log wdqs: reconciled Q27151108 on both eqiad & codfw wdqs endpoints (T386998) [10:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:33] T386998: Items missing from the Wikidata SPARQL query result - https://phabricator.wikimedia.org/T386998 [10:04:59] (03PS1) 10Slyngshede: Upgrade idp-test to 7.1.4 [dns] - 10https://gerrit.wikimedia.org/r/1124376 [10:09:20] with all of that I did not even look at the logs :) [10:10:44] I am doing the train now [10:11:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74030 and previous config saved to /var/cache/conftool/dbconfig/20250304-101124-root.json [10:11:38] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124378 (https://phabricator.wikimedia.org/T386214) [10:11:39] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124378 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [10:12:06] (03CR) 10Elukey: sre.mysql.pool: sanity check for depool operations (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [10:12:30] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124378 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [10:13:05] (03PS1) 10Federico Ceratto: db1253.yaml: prepare for production [puppet] - 10https://gerrit.wikimedia.org/r/1124379 (https://phabricator.wikimedia.org/T385141) [10:13:58] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs5005 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124380 (https://phabricator.wikimedia.org/T384477) [10:14:33] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124380 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [10:19:05] (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1124380 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [10:19:28] !log depooling lvs5005 before reimaging - T384477 [10:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:31] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [10:20:45] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs5005.eqsin.wmnet with reason: depooled before reimage [10:21:28] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs5005 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124380 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [10:21:52] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.19 refs T386214 [10:21:55] T386214: 1.44.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T386214 [10:22:03] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:22:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:22:20] ^^ that's lvs5005 being depooled (BGP alert) [10:22:32] (03CR) 10Marostegui: [C:04-1] "You also need to add it to site.pp and instances.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1124379 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [10:26:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74031 and previous config saved to /var/cache/conftool/dbconfig/20250304-102630-root.json [10:27:12] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs5005.eqsin.wmnet with OS bookworm [10:27:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:34:42] FIRING: JobUnavailable: Reduced availability for job pybal in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3454 MB (3% inode=98%): /tmp 3454 MB (3% inode=98%): /var/tmp 3454 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [10:35:31] 06SRE, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10600094 (10Peter) I've been looking into the data we get from Google/Chrome through the [[ https://developer.chrome.com/docs/cru... [10:35:39] !log T387789 Ran mwscript-k8s --comment="T387789" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'JamesVilla44' 'DartsF4' --ignorestatus [10:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:41] T387789: Unblock stuck global rename of JamesVilla44 - https://phabricator.wikimedia.org/T387789 [10:36:57] (03CR) 10Clément Goubert: [C:03+2] php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar) [10:37:12] (03CR) 10Clément Goubert: [C:03+2] php: Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [10:37:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:37:55] (03PS4) 10Hashar: php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) [10:37:58] (03CR) 10Clément Goubert: [C:03+2] php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar) [10:38:27] (03CR) 10Clément Goubert: [V:03+2 C:03+2] php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar) [10:40:27] (03CR) 10Effie Mouzeli: [C:03+1] php8.1: Set display_startup_errors consistent with display_errors [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124194 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [10:41:01] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:41:01] (03CR) 10Elukey: "I left mostly nits, lemme know your thoughts but they are not blocking. In the future there are bits of the cookbook that could become par" [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [10:41:34] (03PS1) 10Cathal Mooney: Example Python modules for generating Nokia switch config [homer/public] - 10https://gerrit.wikimedia.org/r/1124382 (https://phabricator.wikimedia.org/T371088) [10:41:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74032 and previous config saved to /var/cache/conftool/dbconfig/20250304-104135-root.json [10:42:52] (03Abandoned) 10Cathal Mooney: Example Python modules for generating Nokia switch config [homer/public] - 10https://gerrit.wikimedia.org/r/1124382 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [10:43:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74033 and previous config saved to /var/cache/conftool/dbconfig/20250304-104336-root.json [10:43:44] !log gkyziridis@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:44:27] (03PS19) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [10:46:49] Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'gjlw_namespace' in 'where clause' [10:47:08] which comes from a job that ran on mediawikiwiki . I have filed it at https://phabricator.wikimedia.org/T387843 [10:47:18] (03PS1) 10Clément Goubert: deployment_server: Prune docker images every week [puppet] - 10https://gerrit.wikimedia.org/r/1124383 (https://phabricator.wikimedia.org/T372602) [10:47:19] hashar: checking [10:47:22] that is from jsonconfig [10:47:24] marostegui: thanks ! :) [10:48:17] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5005.eqsin.wmnet with reason: host reimage [10:48:17] (03PS1) 10Mhorsey: Release "separate ongoing events" feature for CampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124384 (https://phabricator.wikimedia.org/T386427) [10:48:26] hashar: https://phabricator.wikimedia.org/T387843#10600134 [10:48:29] Can that be reverted? [10:49:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124384 (https://phabricator.wikimedia.org/T386427) (owner: 10Mhorsey) [10:49:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1019.eqiad.wmnet with reason: Rebuilding index [10:50:52] marostegui: I don't know which code triggers it [10:50:52] (03PS6) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 [10:51:02] I guess the easiest is rolling back the train on it [10:51:32] (03PS9) 10Ayounsi: WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [10:51:38] (03CR) 10CI reject: [V:04-1] WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [10:51:45] (03PS2) 10Ayounsi: Add GraphQL queries to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1122138 [10:52:11] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5005.eqsin.wmnet with reason: host reimage [10:52:31] (03CR) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files (034 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [10:52:39] oh [10:52:41] there is a feature flag [10:52:43] (03Abandoned) 10Mhorsey: Release "separate ongoing events" feature for CampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124384 (https://phabricator.wikimedia.org/T386427) (owner: 10Mhorsey) [10:55:22] marostegui: I am rolling back [10:56:21] (03CR) 10CI reject: [V:04-1] Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [10:56:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74034 and previous config saved to /var/cache/conftool/dbconfig/20250304-105640-root.json [10:57:04] (03CR) 10Muehlenhoff: [C:03+1] "Installation on idp-test1004 looks good to the extent testable not serving idp-test.w.p" [dns] - 10https://gerrit.wikimedia.org/r/1124376 (owner: 10Slyngshede) [10:58:09] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124386 (https://phabricator.wikimedia.org/T386214) [10:58:10] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124386 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [10:58:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2161 db1214', diff saved to https://phabricator.wikimedia.org/P74035 and previous config saved to /var/cache/conftool/dbconfig/20250304-105814-marostegui.json [10:58:31] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2161.codfw.wmnet [10:58:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74036 and previous config saved to /var/cache/conftool/dbconfig/20250304-105842-root.json [10:58:43] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1214.eqiad.wmnet [10:58:55] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124386 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [10:59:25] !log hashar@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.19 refs T386214 [10:59:27] T386214: 1.44.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T386214 [10:59:42] FIRING: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1100) [11:02:06] marostegui: I have rolled back the train, made T387843 an unbreak now and I have poke the task you have mentionned T385917 [11:02:07] T387843: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'gjlw_namespace' in 'where clause' - https://phabricator.wikimedia.org/T387843 [11:02:07] T385917: Deploy patch-gjlw_namespace_text.sql on x1.commonswiki for JsonConfig - https://phabricator.wikimedia.org/T385917 [11:02:15] thank you hashar [11:02:18] and given I don't know anything about that code, I will let devs figure it out [11:02:27] cause I don't know whether it was missed to apply the patch on other wikis [11:02:33] !log joal@deploy2002 Started deploy [analytics/refinery@dbcd265]: Regular analytics weekly train [analytics/refinery@dbcd2652] [11:02:35] or wether the code is not even supposed to write to mediawikiwiki [11:02:38] or whatever really :) [11:02:46] I lack the context! :b [11:02:52] Yeha I don't know either, but they should've waited for us to deploy the schema change [11:02:55] thank you for the quick triage! [11:04:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1214.eqiad.wmnet [11:05:11] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 13072MiB (3% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [11:05:26] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1214.eqiad.wmnet with reason: Index rebuild [11:05:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2161.codfw.wmnet [11:05:31] !log joal@deploy2002 Finished deploy [analytics/refinery@dbcd265]: Regular analytics weekly train [analytics/refinery@dbcd2652] (duration: 02m 58s) [11:06:07] !log joal@deploy2002 Started deploy [analytics/refinery@dbcd265] (thin): Regular analytics weekly train THIN [analytics/refinery@dbcd2652] [11:06:08] (03PS1) 10Effie Mouzeli: shellbox-media: serve 1/8 of requests on 8.1 with more logging (2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124388 (https://phabricator.wikimedia.org/T377038) [11:06:40] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2161.codfw.wmnet with reason: Index rebuild [11:07:02] !log joal@deploy2002 Finished deploy [analytics/refinery@dbcd265] (thin): Regular analytics weekly train THIN [analytics/refinery@dbcd2652] (duration: 00m 55s) [11:07:24] (03CR) 10Effie Mouzeli: [V:03+2 C:03+2] php8.1: Set display_startup_errors consistent with display_errors [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124194 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [11:07:37] !log joal@deploy2002 Started deploy [analytics/refinery@dbcd265] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@dbcd2652] [11:08:12] !log joal@deploy2002 Finished deploy [analytics/refinery@dbcd265] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@dbcd2652] (duration: 00m 35s) [11:09:03] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:09:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Rebuilding index [11:09:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:34] !log hashar@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.19 refs T386214 (duration: 11m 09s) [11:10:37] T386214: 1.44.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T386214 [11:11:44] train rolled back [11:11:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74037 and previous config saved to /var/cache/conftool/dbconfig/20250304-111146-root.json [11:11:47] I am off for lunch [11:11:56] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply [11:12:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5005.eqsin.wmnet with OS bookworm [11:12:47] (03PS20) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [11:13:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74038 and previous config saved to /var/cache/conftool/dbconfig/20250304-111347-root.json [11:14:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3453 MB (3% inode=98%): /tmp 3453 MB (3% inode=98%): /var/tmp 3453 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [11:15:07] (03PS1) 10Vgutierrez: hiera: Restore lvs5005 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1124389 (https://phabricator.wikimedia.org/T384477) [11:16:33] (03PS2) 10Vgutierrez: hiera: Restore lvs5005 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1124389 (https://phabricator.wikimedia.org/T384477) [11:16:40] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1233.eqiad.wmnet onto db1246.eqiad.wmnet [11:16:40] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): openstack galera no recent writes 2025-03-04, suspected network hardware problem - https://phabricator.wikimedia.org/T387828#10600193 (10aborrero) [11:17:05] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124389 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:20:22] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1233.eqiad.wmnet onto db1246.eqiad.wmnet [11:20:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74039 and previous config saved to /var/cache/conftool/dbconfig/20250304-112035-root.json [11:20:59] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): openstack galera no recent writes 2025-03-04, suspected network hardware problem - https://phabricator.wikimedia.org/T387828#10600211 (10aborrero) 05In progress→03Open [11:22:04] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply [11:22:50] !log joal@deploy2002 Started deploy [airflow-dags/analytics@9a0b051]: Regular analytics weekly train [airflow-dags/analytics@9a0b0519] [11:23:26] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@9a0b051]: Regular analytics weekly train [airflow-dags/analytics@9a0b0519] (duration: 00m 35s) [11:23:30] (03PS1) 10David Caro: maintain-dbusers: change the primary to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/1124391 (https://phabricator.wikimedia.org/T387845) [11:23:33] (03CR) 10Fabfur: [C:03+1] hiera: Restore lvs5005 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1124389 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:24:05] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1124391 (https://phabricator.wikimedia.org/T387845) (owner: 10David Caro) [11:24:27] (03CR) 10Vgutierrez: [C:03+2] hiera: Restore lvs5005 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1124389 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:25:08] (03PS1) 10Volans: sre.k8s: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124393 [11:25:08] (03PS1) 10Volans: sre.ganeti: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124394 [11:25:08] (03PS1) 10Volans: sre.hosts: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 [11:25:09] (03PS1) 10Volans: sre.network: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124396 [11:25:10] (03PS1) 10Volans: sre.gitlab: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124397 [11:25:12] (03PS1) 10Volans: sre.swift: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124398 [11:25:16] (03PS1) 10Volans: sre.deploy: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124399 [11:25:50] (03CR) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [11:25:58] (03CR) 10David Caro: [C:03+2] maintain-dbusers: change the primary to cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/1124391 (https://phabricator.wikimedia.org/T387845) (owner: 10David Caro) [11:27:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [11:28:03] !log repooling lvs5005 running liberica - T384477 [11:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:06] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [11:28:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74043 and previous config saved to /var/cache/conftool/dbconfig/20250304-112852-root.json [11:30:55] (03PS1) 10Clément Goubert: mw-debug: Add staging releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124392 [11:34:44] (03CR) 10Volans: sre.k8s: use new run_cookbook features (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124393 (owner: 10Volans) [11:35:21] (03PS3) 10Hnowlan: trafficserver: send PUTs to the write datacentre [puppet] - 10https://gerrit.wikimedia.org/r/1123625 (https://phabricator.wikimedia.org/T387509) [11:35:35] (03CR) 10Volans: sre.ganeti: use new run_cookbook features (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124394 (owner: 10Volans) [11:35:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74044 and previous config saved to /var/cache/conftool/dbconfig/20250304-113541-root.json [11:36:37] (03CR) 10Volans: sre.hosts: use new run_cookbook features (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans) [11:37:12] (03PS4) 10Hnowlan: trafficserver: send PUTs to the write datacentre [puppet] - 10https://gerrit.wikimedia.org/r/1123625 (https://phabricator.wikimedia.org/T387509) [11:37:43] (03CR) 10Hnowlan: "Any thoughts on whether we should do PATCH/DELETE in the same manner?" [puppet] - 10https://gerrit.wikimedia.org/r/1123625 (https://phabricator.wikimedia.org/T387509) (owner: 10Hnowlan) [11:38:01] (03CR) 10Volans: sre.gitlab: use new run_cookbook features (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124397 (owner: 10Volans) [11:39:04] (03CR) 10Volans: "In case you prefer a more interactive approach we can replace raises=True with confirm=True to wrap it with confirm_on_failure." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124398 (owner: 10Volans) [11:39:44] (03PS6) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) [11:41:01] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:41:15] (03CR) 10Effie Mouzeli: [C:03+1] "thank you for taking care of this" [puppet] - 10https://gerrit.wikimedia.org/r/1124383 (https://phabricator.wikimedia.org/T372602) (owner: 10Clément Goubert) [11:43:02] !log jiji@deploy2002 Started scap sync-world: Deploy php 8.1.34-1-s3 image [11:43:38] !log Deleting obsolete puppet certs for eventstreams.discovery.wmnet and eventgate-analytics-external.discovery.wmnet [11:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74045 and previous config saved to /var/cache/conftool/dbconfig/20250304-114358-root.json [11:45:30] (03CR) 10Clément Goubert: [C:03+2] deployment_server: Prune docker images every week [puppet] - 10https://gerrit.wikimedia.org/r/1124383 (https://phabricator.wikimedia.org/T372602) (owner: 10Clément Goubert) [11:48:21] PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/765ce1789b8290f1498bbe51e91f2d8f4e523281b00f31faf750edc0229586ba/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [11:49:11] (03PS1) 10Muehlenhoff: Add bwang to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124402 (https://phabricator.wikimedia.org/T387614) [11:50:18] (03CR) 10Hnowlan: "Given that we'll need this increase in the next few days for the single DC testing that's happening parallel to the switchover, I think ma" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [11:50:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74046 and previous config saved to /var/cache/conftool/dbconfig/20250304-115047-root.json [11:52:02] (03PS1) 10Fabfur: cache,haproxy: create tmpfile configuration for tls [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) [11:53:48] RESOLVED: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:54:04] (03CR) 10Muehlenhoff: [C:03+2] Add bwang to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124402 (https://phabricator.wikimedia.org/T387614) (owner: 10Muehlenhoff) [11:54:27] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10600299 (10Clement_Goubert) Some of that is because we're building double the images we used to with the 8.1 migration. As a mitigation, I have lowered the ag... [11:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10600300 (10phaultfinder) [11:54:43] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur) [11:54:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3451 MB (3% inode=98%): /tmp 3451 MB (3% inode=98%): /var/tmp 3451 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [11:55:55] (03CR) 10Fabfur: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [11:55:57] (03CR) 10Vgutierrez: "inline question, looking good" [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) (owner: 10Fabfur) [11:56:29] (03PS20) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [11:57:43] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10600326 (10Clement_Goubert) Cleaned up all images older than 7 days: ` cgoubert@deploy2002:~$ df -h /srv Filesystem Size Used Avail Use% Mounted on... [11:59:24] (03PS1) 10Vgutierrez: hiera,site: Reimage lvs5004 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124407 (https://phabricator.wikimedia.org/T384477) [11:59:26] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10600332 (10fnegri) 05Resolved→03Open The alert has not fired since Feb 27. The alert is based on `ipmi_temperature_state`, and that seems t... [11:59:56] (03PS2) 10Vgutierrez: hiera,site: Reimage lvs5004 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124407 (https://phabricator.wikimedia.org/T384477) [12:01:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmf for bwang - https://phabricator.wikimedia.org/T387614#10600344 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @bwang: Ack! I've just merged a patch to add you to the group, it will take 25 minutes until it's r... [12:01:32] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124407 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [12:01:42] (03CR) 10Fabfur: systemd: add path unit type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) (owner: 10Fabfur) [12:02:10] jouncebot: now [12:02:10] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [12:02:36] (03CR) 10Hnowlan: [C:03+2] citoid: migrate group1 wikis to use rest-gateway instead of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [12:03:28] (03PS3) 10Sergio Gimeno: [Growth] Enable surfacing structured tasks A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120505 (https://phabricator.wikimedia.org/T385343) [12:03:54] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for chuckonwumelu - https://phabricator.wikimedia.org/T387627#10600352 (10MoritzMuehlenhoff) 05Open→03Resolved [12:03:58] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for DHardy-WMF - https://phabricator.wikimedia.org/T387157#10600353 (10MoritzMuehlenhoff) 05Open→03Resolved [12:04:38] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [12:05:44] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mobile for WRai-WMF - https://phabricator.wikimedia.org/T387786#10600363 (10MoritzMuehlenhoff) @Seddon This needs your approval as the manager @thcipriani This needs your approval as the group approver [12:05:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74047 and previous config saved to /var/cache/conftool/dbconfig/20250304-120552-root.json [12:06:08] (03CR) 10Vgutierrez: systemd: add path unit type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) (owner: 10Fabfur) [12:06:40] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mobile for WRai-WMF - https://phabricator.wikimedia.org/T387786#10600366 (10Seddon) Approved! [12:07:11] jouncebot: next [12:07:11] In 0 hour(s) and 52 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1300) [12:07:35] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124407 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [12:07:56] (03CR) 10FNegri: [C:03+1] "Nice! I think this can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [12:08:21] RECOVERY - Disk space on deploy2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [12:13:52] (03CR) 10Fabfur: [C:03+2] hiera,site: Reimage lvs5004 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124407 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [12:14:00] what? [12:14:35] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1124407 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [12:14:59] (03PS8) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) [12:14:59] (03PS17) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [12:15:43] (03CR) 10Fabfur: systemd: add path unit type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) (owner: 10Fabfur) [12:16:06] (03CR) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [12:17:46] (03PS18) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [12:20:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74048 and previous config saved to /var/cache/conftool/dbconfig/20250304-122057-root.json [12:20:58] (03PS1) 10JMeybohm: validating-admission-policies: Be more explicit in tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124415 (https://phabricator.wikimedia.org/T368251) [12:21:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:23:31] !log jiji@deploy2002 Started scap sync-world: Deploy php 8.1.34-1-s3 image [12:24:48] (03PS2) 10JMeybohm: mw-debug: Add staging releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124392 (owner: 10Clément Goubert) [12:24:48] (03PS1) 10JMeybohm: Add pod-security.wmg.org labels to mediawiki namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) [12:27:06] !log jiji@deploy2002 Finished scap sync-world: Deploy php 8.1.34-1-s3 image (duration: 04m 59s) [12:29:32] 06SRE, 10Math, 06Traffic: Determine the cause of x8 increase in requests to math endpoints between july 6 and August 3 2023 - https://phabricator.wikimedia.org/T344329#10600461 (10MSantos) [12:32:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Rebuilding index [12:33:42] (03PS1) 10Hnowlan: trafficserver: remove restbase from citoid request path everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1124418 (https://phabricator.wikimedia.org/T361576) [12:34:28] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [12:34:47] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [12:35:49] (03CR) 10MSantos: [C:03+1] pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [12:39:25] (03PS3) 10JMeybohm: mw-debug: Add staging releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124392 (owner: 10Clément Goubert) [12:39:25] (03PS2) 10JMeybohm: Add pod-security.wmg.org labels to wikikube mediawiki namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) [12:41:15] (03Merged) 10jenkins-bot: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [12:44:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica2006.wikimedia.org [12:48:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica2006.wikimedia.org [12:51:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica2005.wikimedia.org [12:51:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:53:30] (03PS2) 10Federico Ceratto: db1253.yaml: prepare for production [puppet] - 10https://gerrit.wikimedia.org/r/1124379 (https://phabricator.wikimedia.org/T385141) [12:55:20] (03CR) 10JMeybohm: [C:03+2] mw-debug: Add staging releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124392 (owner: 10Clément Goubert) [12:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:58:21] jouncebot: next [12:58:21] In 0 hour(s) and 1 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1300) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1300) [13:00:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica2005.wikimedia.org [13:01:02] (03Merged) 10jenkins-bot: mw-debug: Add staging releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124392 (owner: 10Clément Goubert) [13:02:03] (03CR) 10Volans: "I think is a nice approach and we can go ahead with it. I've left very minor comments inline." [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [13:06:50] (03CR) 10Sergio Gimeno: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120505 (https://phabricator.wikimedia.org/T385343) (owner: 10Sergio Gimeno) [13:08:21] (03PS1) 10Dreamy Jazz: Create temporary-account-viewer group [extensions/CheckUser] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124429 (https://phabricator.wikimedia.org/T387205) [13:08:34] jouncebot: nowandnext [13:08:35] For the next 0 hour(s) and 51 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1300) [13:08:35] In 0 hour(s) and 51 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1400) [13:11:47] Anyone mind if I deploy? [13:13:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124429 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [13:17:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124429 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [13:19:23] (03PS1) 10Jelto: services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) [13:22:46] (03CR) 10Jelto: [V:03+1] "I tested this with the miscweb service and no more warnings about environments and values file:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto) [13:24:39] (03CR) 10Aklapper: [V:03+2 C:03+2] Replace DB transaction query with FeedQuery to estimate recent edits [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121746 (https://phabricator.wikimedia.org/T386704) (owner: 10Aklapper) [13:29:50] (03Merged) 10jenkins-bot: Create temporary-account-viewer group [extensions/CheckUser] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124429 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [13:30:23] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1124429|Create temporary-account-viewer group (T387205)]] [13:30:26] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [13:31:38] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable $wgCampaignEventsSeparateOngoingEvents by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124172 (https://phabricator.wikimedia.org/T386427) (owner: 10Daimona Eaytoy) [13:34:30] (03CR) 10Lucas Werkmeister (WMDE): "This is currently scheduled as the last change in the backport window, but I think it might make sense to move it up? That way, if it brea" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [13:34:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3625 MB (3% inode=98%): /tmp 3625 MB (3% inode=98%): /var/tmp 3625 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [13:38:43] (03PS2) 10Jelto: services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) [13:38:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica1004.wikimedia.org [13:43:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Rebuilding index [13:44:18] (03PS3) 10Jelto: services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) [13:45:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica1004.wikimedia.org [13:46:17] My backport is being slow due to it containing an i18n change. I think it should be done before the start of the backport window. [13:46:24] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:46:50] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:46:53] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:48:01] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:49:46] Actually, the k8s deployment itself is going much slower than last time I deployed a backport with i18n changes. [13:52:24] (03PS7) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) [13:53:24] It's looking like I will collide with the scheduled changes in the backport window. Apologies to those who have scheduled changes. [13:54:06] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1124429|Create temporary-account-viewer group (T387205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:54:08] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [13:54:09] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [13:56:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica1003.wikimedia.org [13:58:14] Dreamy_Jazz: I think you should ping the deployers of the forthcoming window [13:58:54] (03PS8) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) [13:59:43] (03CR) 10Fabfur: systemd: add path unit type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) (owner: 10Fabfur) [13:59:52] itamarWMDE, Daimona, tgr, claime: For the above. My backport is being much slower than I had anticipated. Currently at the sync-prod-k8s stage of deployment. [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1400). Please do the needful. [14:00:05] itamarWMDE, Daimona, and claime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:27] o/ [14:00:32] I could deploy if needed [14:00:37] (once the current deploy is done, that is) [14:00:52] but wouldn’t mind someone else doing the window either ^^ [14:01:44] I don't mind doing the deployments [14:01:53] (03CR) 10Marostegui: [C:04-1] "You also need to remove it from the spare role (line 972)" [puppet] - 10https://gerrit.wikimedia.org/r/1124379 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [14:02:03] Considering I'm causing the hold up :D [14:02:09] :D [14:02:31] (03CR) 10Elukey: sre.mysql.pool: sanity check for depool operations (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [14:02:33] o/ [14:02:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica1003.wikimedia.org [14:03:34] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp7005.*} or P{cp7009.*} or P{cp[7011-7014]*} or P{cp7016.*} and A:cp for 9.2.6-1wm2 [14:04:18] (03CR) 10Vgutierrez: [C:03+1] systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) (owner: 10Fabfur) [14:09:19] (03CR) 10Effie Mouzeli: [C:03+1] mw-api-int: serve 25% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124210 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [14:10:25] itamarWMDE, tgr_, claime: Are you around for the window? [14:10:37] yeah [14:11:22] mine can go last btw as it can potentially break scap (even if that time it shouldn't) and necessitate a rollback [14:11:33] Sure. Thanks. [14:11:39] Unfortunately some of the k8s deployments have failed on my deployment. [14:12:30] which ones? [14:12:36] (03CR) 10Stevemunene: [C:03+1] hiera,eventschemas: Enable IPIP on schema@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124138 (https://phabricator.wikimedia.org/T387308) (owner: 10Vgutierrez) [14:12:46] mw-web-next and mw-api-ext-next [14:12:54] Scap backport is now rolling back [14:13:12] effie: ^ possibly something wrong with the new image? [14:13:28] (03PS1) 10Klausman: hiera/partman: set up ML k8s staging workers for containerd [puppet] - 10https://gerrit.wikimedia.org/r/1124436 (https://phabricator.wikimedia.org/T387854) [14:13:37] the thing is I rolled it myself before [14:13:45] (03PS1) 10Ayounsi: Move fetch_device_interfaces from wmf-netbox to core Homer [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 [14:14:05] Dreamy_Jazz: let me have a look [14:14:07] (03CR) 10Stevemunene: [C:03+1] hiera,eventschemas: Enable IPIP on schema@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124137 (https://phabricator.wikimedia.org/T387308) (owner: 10Vgutierrez) [14:14:23] (03PS10) 10Ayounsi: WIP: wmf-netbox use GraphQL for fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [14:14:41] Dreamy_Jazz: any useful messages? [14:14:48] Lots of output [14:14:50] Let me see [14:15:46] (03CR) 10Elukey: [C:03+1] aux-k8s-worker: deploy role to codfw workers [puppet] - 10https://gerrit.wikimedia.org/r/1123434 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [14:16:07] (03PS3) 10Ayounsi: Add GraphQL queries to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1122138 [14:16:09] (03CR) 10Elukey: [C:03+1] aux-k8s codfw: enable worker ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124179 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [14:16:11] (03CR) 10Volans: Move fetch_device_interfaces from wmf-netbox to core Homer (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [14:16:23] (03CR) 10Ssingh: "Nice work, quick question in-line:" [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [14:16:46] WARNING: Top-level config key environments must be defined before releases in hemfile.yaml skipping missing values file matching ... [14:16:48] !log depooling lvs5004 before reimaging - T384477 [14:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:51] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [14:17:13] Dreamy_Jazz: that can be ignored, artifact of an upgrade to helmfile [14:17:14] Also talks about the configuration file being group readable [14:17:21] that can also be ignore [14:17:23] d [14:17:26] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs5004.eqsin.wmnet with reason: depooled before reimage [14:17:27] (03CR) 10Elukey: [C:03+1] sre.k8s: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124393 (owner: 10Volans) [14:17:44] The only other thing I can see is "UPGRADE FILED: release next failed, and has been rolled back due to atomic being set: timed out waiting for condition" [14:17:55] (03CR) 10Elukey: [C:03+1] sre.ganeti: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124394 (owner: 10Volans) [14:18:03] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:18:09] Scap has now exited [14:18:22] Dreamy_Jazz: let me check the releases [14:18:41] It does say "Rollback completed" in the output [14:18:44] Dreamy_Jazz: ok so that tells us it couldn't deploy the release in time, thanks [14:18:46] (03CR) 10Vgutierrez: [C:03+2] hiera,site: Reimage lvs5004 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124407 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:19:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124394 (owner: 10Volans) [14:19:08] (03CR) 10Elukey: [C:03+1] sre.hosts: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans) [14:19:09] claime: I think I will do a sync world [14:19:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:19:19] (03CR) 10Elukey: [C:03+1] sre.network: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124396 (owner: 10Volans) [14:19:20] ack [14:19:50] (03CR) 10CI reject: [V:04-1] Move fetch_device_interfaces from wmf-netbox to core Homer [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [14:20:01] Dreamy_Jazz: let me do a sync world, it should be ok [14:20:08] Sure. [14:20:13] !log installing libtasn1-6 security updates [14:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:27] (03CR) 10Elukey: [C:03+1] sre.swift: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124398 (owner: 10Volans) [14:20:48] (03CR) 10Elukey: [C:03+1] sre.deploy: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124399 (owner: 10Volans) [14:21:14] Given it's been rolled back, I won't attempt to deploy my backport again. I can still run the window. [14:21:47] Dreamy_Jazz: actually, can you please run it one more time? [14:21:48] I'll create a revert for the change in wmf.19 and merge it [14:21:52] Okay, sure. [14:22:21] i.e. `scap backport 1124429` [14:22:23] (03PS3) 10Jgiannelos: pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) [14:22:45] (03CR) 10Jgiannelos: "Updated the patch after feedback from the apps team" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [14:22:47] Running now [14:23:03] Dreamy_Jazz: yoour change made it to the main releases [14:23:07] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1124429|Create temporary-account-viewer group (T387205)]] [14:23:10] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [14:23:23] the php8.1 releases didnt upgrade on time [14:23:46] Yeah. It was on the last leg of deployment, so it seemed to just be that. [14:24:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:24:31] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: eventschemas::service@codfw [14:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10600887 (10phaultfinder) [14:24:37] (03CR) 10Vgutierrez: [C:03+2] hiera,eventschemas: Enable IPIP on schema@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124137 (https://phabricator.wikimedia.org/T387308) (owner: 10Vgutierrez) [14:24:40] (03CR) 10Stevemunene: [C:03+1] hiera,analytics_cluster: Enable IPIP on datahubsearch@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124064 (https://phabricator.wikimedia.org/T387306) (owner: 10Vgutierrez) [14:24:42] FIRING: JobUnavailable: Reduced availability for job pybal in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:25:04] (03CR) 10Herron: [C:03+2] "good catch thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1124365 (https://phabricator.wikimedia.org/T387827) (owner: 10Filippo Giunchedi) [14:25:15] (03CR) 10Elukey: [C:03+1] "Left a nit to fix but the rest LGTM, you can go after the fix." [puppet] - 10https://gerrit.wikimedia.org/r/1124436 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [14:25:22] Dreamy_Jazz: how is it looking? [14:25:34] Currently in sync-testservers-k8s [14:25:38] All green at the moment [14:25:50] pybal@eqsin alert it's me [14:26:42] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124441 [14:26:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet [14:27:26] The deployment is being as slow as before btw [14:27:39] (03CR) 10Stevemunene: [C:03+1] hiera,wcqs: Enable IPIP on wcqs@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123664 (https://phabricator.wikimedia.org/T387313) (owner: 10Vgutierrez) [14:28:13] (03PS2) 10Klausman: hiera/partman: set up ML k8s staging workers for containerd [puppet] - 10https://gerrit.wikimedia.org/r/1124436 (https://phabricator.wikimedia.org/T387854) [14:28:43] Now on sync-testservers [14:28:56] (03CR) 10Klausman: hiera/partman: set up ML k8s staging workers for containerd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124436 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [14:29:11] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs5004.eqsin.wmnet with OS bookworm [14:29:16] (03PS3) 10Klausman: hiera/partman: set up ML k8s staging workers for containerd [puppet] - 10https://gerrit.wikimedia.org/r/1124436 (https://phabricator.wikimedia.org/T387854) [14:29:31] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on A:logstash-collector [14:29:36] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:29:38] (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP for wdqs(-ssl|-heavy-queries)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123668 (https://phabricator.wikimedia.org/T387314) (owner: 10Vgutierrez) [14:29:50] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1124429|Create temporary-account-viewer group (T387205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:30:00] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [14:30:02] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [14:30:10] (03PS1) 10Effie Mouzeli: Revert "php8.1: Set display_startup_errors consistent with display_errors" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124444 [14:30:25] (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP for wdqs(-ssl|-heavy-queries)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123667 (https://phabricator.wikimedia.org/T387314) (owner: 10Vgutierrez) [14:30:38] Deployment has failed again [14:30:44] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:30:46] effie: [14:30:58] Error is different this time [14:31:02] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [14:31:03] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: eventschemas::service@codfw [14:31:04] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1031.eqiad.wmnet with reason: remove from cluster for reimage [14:31:07] Talks about connection issues [14:31:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10600932 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=49bd4e46-521c-46ca-9334-5c777206e882) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [14:31:11] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:31:36] Dreamy_Jazz: something slightly more helpful? [14:31:37] (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-main@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123672 (https://phabricator.wikimedia.org/T387315) (owner: 10Vgutierrez) [14:32:13] (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-main@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123673 (https://phabricator.wikimedia.org/T387315) (owner: 10Vgutierrez) [14:32:22] Dreamy_Jazz: mind if I hijack your tmux a minute? [14:32:35] Sure. [14:32:41] (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-scholarly@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123677 (https://phabricator.wikimedia.org/T387316) (owner: 10Vgutierrez) [14:32:54] (03CR) 10Ayounsi: Move fetch_device_interfaces from wmf-netbox to core Homer (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [14:32:56] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5024/co" [puppet] - 10https://gerrit.wikimedia.org/r/1124436 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [14:33:21] I did try to copy the text, but it appeared to just cause me to exit the command. [14:33:42] That was after the errors were displayed [14:33:44] Dreamy_Jazz: print screen :) [14:33:45] (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-scholarly@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123676 (https://phabricator.wikimedia.org/T387316) (owner: 10Vgutierrez) [14:34:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1031.eqiad.wmnet with OS bookworm [14:34:18] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10600934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1031.eqiad.wmnet with OS bookworm [14:34:43] 21m Warning FailedMount pod/mw-web.codfw.next-d6cd796c6-srx8s MountVolume.SetUp failed for volume "mediawiki-next-httpd-early" : failed to sync configmap cache: timed out waiting for the condition < that's not great, that's the connection to the apiservers failing [14:34:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3510 MB (3% inode=98%): /tmp 3510 MB (3% inode=98%): /var/tmp 3510 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [14:35:27] not that [14:35:31] wrong copy [14:36:04] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10600939 (10MoritzMuehlenhoff) a:05BTullis→03MoritzMuehlenhoff [14:36:11] I'll revert effie's change see if it's better, but I also need to take a look at the apiservers first [14:36:41] (03PS1) 10Jgiannelos: pcs: Enable cache jitter to avoid mass invalidation of objects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124447 (https://phabricator.wikimedia.org/T387472) [14:36:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=0) rolling restart_daemons on A:logstash-collector [14:37:38] Hmm. It's looking like for sure not all changes in the window can be done. [14:37:57] (03PS2) 10Fabfur: cache,haproxy: create tmpfile configuration for tls [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) [14:38:42] (03CR) 10Klausman: [V:03+2 C:03+2] hiera/partman: set up ML k8s staging workers for containerd [puppet] - 10https://gerrit.wikimedia.org/r/1124436 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [14:39:21] (03CR) 10MVernon: [C:03+1] cassandra: obsolete secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1123703 (https://phabricator.wikimedia.org/T387586) (owner: 10Eevans) [14:39:42] FIRING: [2x] JobUnavailable: Reduced availability for job pybal in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:47] (03PS2) 10Jgiannelos: pcs: Enable cache jitter to avoid mass invalidation of objects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124447 (https://phabricator.wikimedia.org/T387472) [14:40:43] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: eventschemas::service@eqiad [14:40:49] (03CR) 10MVernon: [C:03+1] sre.swift: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124398 (owner: 10Volans) [14:40:54] (03CR) 10Vgutierrez: [C:03+2] hiera,eventschemas: Enable IPIP on schema@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124138 (https://phabricator.wikimedia.org/T387308) (owner: 10Vgutierrez) [14:40:59] (03PS2) 10Vgutierrez: hiera,eventschemas: Enable IPIP on schema@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124138 (https://phabricator.wikimedia.org/T387308) [14:41:01] (03CR) 10Herron: [C:03+2] aux-k8s-worker: deploy role to codfw workers [puppet] - 10https://gerrit.wikimedia.org/r/1123434 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [14:41:09] ok I'm not finding anything in k8s apiserver metrics, effie swfrench-wmf [14:41:11] (03PS8) 10Herron: aux-k8s-worker: deploy role to codfw workers [puppet] - 10https://gerrit.wikimedia.org/r/1123434 (https://phabricator.wikimedia.org/T381417) [14:41:16] will revert the change [14:41:31] claime: then let's proceed with reverting the change, and dig afterwards [14:41:37] yeah [14:41:54] there is absolutely no reason this change would do that =/ [14:42:14] swfrench-wmf: I can't find any either [14:42:25] but rn I'm out of ideas so might as well exclude it [14:42:55] can you +1 the revert? [14:43:10] (03CR) 10Vgutierrez: cache,haproxy: create tmpfile configuration for tls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur) [14:43:15] looking [14:43:24] (03CR) 10Herron: [C:03+2] aux-k8s-worker: deploy role to codfw workers [puppet] - 10https://gerrit.wikimedia.org/r/1123434 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [14:43:50] (03CR) 10Vgutierrez: [C:03+2] hiera,eventschemas: Enable IPIP on schema@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124138 (https://phabricator.wikimedia.org/T387308) (owner: 10Vgutierrez) [14:43:59] claime: is this revert actually going to do anything? it's a literal revert [14:44:13] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10600997 (10tappof) @Papaul No, I'm working on setting these objects only in Prometheus. [14:44:13] oh god, yeah no [14:44:14] it needs to revert the changes and then advance the tag [14:44:18] yeah [14:44:32] here, let me fix it - gimme a minute [14:44:46] (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-internal@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123679 (https://phabricator.wikimedia.org/T387318) (owner: 10Vgutierrez) [14:46:03] !log restarting r/w slapds to pick up libtasn updates [14:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:04] (03CR) 10Fabfur: [C:03+2] systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) (owner: 10Fabfur) [14:47:55] (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-internal@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123678 (https://phabricator.wikimedia.org/T387318) (owner: 10Vgutierrez) [14:48:28] (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-internal-main@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123685 (https://phabricator.wikimedia.org/T387319) (owner: 10Vgutierrez) [14:49:03] (03PS2) 10Scott French: Revert "php8.1: Set display_startup_errors consistent with display_errors" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124444 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [14:49:38] claime: fixed - PTAL [14:49:54] (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-internal-scholarly@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123688 (https://phabricator.wikimedia.org/T387320) (owner: 10Vgutierrez) [14:49:59] (03PS1) 10Michael Große: fix(surfacing): don't show highlights on protected pages [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124449 [14:50:09] (03CR) 10Clément Goubert: [C:03+1] Revert "php8.1: Set display_startup_errors consistent with display_errors" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124444 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [14:50:11] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5004.eqsin.wmnet with reason: host reimage [14:50:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124449 (owner: 10Michael Große) [14:50:32] (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-internal-scholarly@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123689 (https://phabricator.wikimedia.org/T387320) (owner: 10Vgutierrez) [14:50:45] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2002.codfw.wmnet with OS bookworm [14:50:48] (03PS1) 10Michael Große: fix(surfacing): don't show highlights on protected pages [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124451 [14:50:52] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [14:51:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124451 (owner: 10Michael Große) [14:51:04] !log klausman@cumin2002 START - Cookbook sre.hosts.move-vlan for host ml-staging2002 [14:51:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on ml-staging2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:49] (03PS3) 10Federico Ceratto: db1253.yaml: prepare for production [puppet] - 10https://gerrit.wikimedia.org/r/1124379 (https://phabricator.wikimedia.org/T385141) [14:51:58] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [14:51:59] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: eventschemas::service@eqiad [14:52:01] ^^ known, I did a n overly-broad puppet change. Fortunately, it's just staging. [14:52:14] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10601026 (10Jclark-ctr) Thanks for ordering drive @Papaul It had not been replaced yet i just located it in new cage near disk degausser will replace drive and work with dell on getting 2 more replacements [14:52:28] (03CR) 10Clément Goubert: [C:03+2] Revert "php8.1: Set display_startup_errors consistent with display_errors" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124444 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [14:52:30] (03CR) 10Clément Goubert: [V:03+2 C:03+2] Revert "php8.1: Set display_startup_errors consistent with display_errors" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124444 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [14:53:41] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5004.eqsin.wmnet with reason: host reimage [14:53:42] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1031.eqiad.wmnet with reason: host reimage [14:53:42] (03CR) 10Bking: [C:03+2] elastic: bring repurposed hosts back into elastic clusters [puppet] - 10https://gerrit.wikimedia.org/r/1124203 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking) [14:54:17] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [14:56:08] !log cgoubert@deploy2002 Started scap sync-world: Deploying [[gerrit:1124444|Revert "php8.1: Set display_startup_errors consistent with display_errors" [14:57:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1031.eqiad.wmnet with reason: host reimage [14:57:35] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp7005.*} or P{cp7009.*} or P{cp[7011-7014]*} or P{cp7016.*} and A:cp for 9.2.6-1wm2 [14:57:40] FIRING: KubernetesRsyslogDown: rsyslog on aux-k8s-worker2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-worker2005 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:58:20] (03CR) 10Marostegui: [C:03+1] db1253.yaml: prepare for production [puppet] - 10https://gerrit.wikimedia.org/r/1124379 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [14:58:25] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387868 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [14:58:31] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387868 (10ops-monitoring-bot) 03NEW [14:58:47] !log klausman@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-staging2002 - klausman@cumin2002" [14:58:52] !log klausman@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-staging2002 - klausman@cumin2002" [14:58:53] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:53] !log klausman@cumin2002 START - Cookbook sre.dns.wipe-cache ml-staging2002.codfw.wmnet 174.48.192.10.in-addr.arpa 4.7.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:58:56] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-staging2002.codfw.wmnet 174.48.192.10.in-addr.arpa 4.7.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:58:57] !log klausman@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ml-staging2002 [14:59:13] !log klausman@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-staging2002 [14:59:13] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-staging2002 [15:01:21] (03CR) 10Btullis: [C:03+1] "Looks good,thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1124064 (https://phabricator.wikimedia.org/T387306) (owner: 10Vgutierrez) [15:02:14] (03PS1) 10Herron: KubernetesRsyslogDown: bump threshold to 15m [alerts] - 10https://gerrit.wikimedia.org/r/1124453 (https://phabricator.wikimedia.org/T381417) [15:02:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:02:55] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10601065 (10Jclark-ctr) a:03Jclark-ctr opened new service request 206452108 [15:03:50] (03CR) 10CI reject: [V:04-1] KubernetesRsyslogDown: bump threshold to 15m [alerts] - 10https://gerrit.wikimedia.org/r/1124453 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [15:04:42] FIRING: [2x] JobUnavailable: Reduced availability for job pybal in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:20] 06SRE, 06SRE Observability: Ops-monitoring-bot creating duplicate tasks for the same RAID failure - https://phabricator.wikimedia.org/T387754#10601077 (10fgiunchedi) I have disabled the handler for the raid check here as a bandaid: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=an-presto10... [15:07:43] (03PS4) 10Sergio Gimeno: [Growth] Enable surfacing structured tasks A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120505 (https://phabricator.wikimedia.org/T385343) [15:08:24] (03PS3) 10Fabfur: cache,haproxy: create tmpfile configuration for tls [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) [15:09:04] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1013 - https://phabricator.wikimedia.org/T387252#10601083 (10Jclark-ctr) error is no longer present jclark@backup1013:~$ cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdb1[0] sdc... [15:09:14] (03CR) 10Fabfur: cache,haproxy: create tmpfile configuration for tls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur) [15:09:50] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1013 - https://phabricator.wikimedia.org/T387252#10601084 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [15:13:04] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:13:35] Will the deployment of the revert also deploy the CheckUser backport I was attempting to backport? [15:15:18] (03CR) 10MVernon: "> I think this should be the last step once the stable-to-bullseye backports are available in apt-staging. Other than my actually going an" [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [15:15:20] (03CR) 10Vgutierrez: cache,haproxy: create tmpfile configuration for tls (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur) [15:15:58] I'm assuming it is because the sync-world command AFAIK applies to all changes, but didn't see it in the logmsgbot message [15:16:24] (03PS2) 10Daimona Eaytoy: Enable $wgCampaignEventsSeparateOngoingEvents by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124172 (https://phabricator.wikimedia.org/T386427) [15:16:40] (03PS3) 10Daimona Eaytoy: Enable $wgCampaignEventsSeparateOngoingEvents by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124172 (https://phabricator.wikimedia.org/T386427) [15:16:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1031.eqiad.wmnet with OS bookworm [15:16:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10601137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1031.eqiad.wmnet with OS bookworm completed: - ganeti103... [15:16:53] (03CR) 10Daimona Eaytoy: Enable $wgCampaignEventsSeparateOngoingEvents by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124172 (https://phabricator.wikimedia.org/T386427) (owner: 10Daimona Eaytoy) [15:16:54] (03CR) 10Hnowlan: [C:03+1] pcs: Enable cache jitter to avoid mass invalidation of objects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124447 (https://phabricator.wikimedia.org/T387472) (owner: 10Jgiannelos) [15:17:13] Dreamy_Jazz: Everything merged gets deployed [15:17:26] (03CR) 10Jgiannelos: [C:03+2] pcs: Enable cache jitter to avoid mass invalidation of objects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124447 (https://phabricator.wikimedia.org/T387472) (owner: 10Jgiannelos) [15:17:26] Dreamy_Jazz: yeah [15:17:41] (03CR) 10Hnowlan: [C:03+1] pcs: Enable more wikis for native PCS pregeneration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [15:17:42] (during a sync-world, which is the last step of a scap backport) [15:17:43] The logmsgbot message was manually and badly typed by me [15:17:53] Thanks [15:18:01] Apologies for the distruption this has caused. [15:18:10] not your fault lol [15:18:30] Fair. Wasn't sure if the i18n change was the straw on the camels back kind of thing [15:18:33] Thanks [15:18:37] I don't think so [15:19:26] (03Merged) 10jenkins-bot: pcs: Enable cache jitter to avoid mass invalidation of objects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124447 (https://phabricator.wikimedia.org/T387472) (owner: 10Jgiannelos) [15:19:48] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10601145 (10Jhancock.wm) @Papaul yes, please delete and i'll try again this afternoon. ty! [15:19:59] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5004.eqsin.wmnet with OS bookworm [15:20:04] (03CR) 10Federico Ceratto: [C:03+2] db1253.yaml: prepare for production [puppet] - 10https://gerrit.wikimedia.org/r/1124379 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [15:20:35] (03PS4) 10Jgiannelos: pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) [15:20:58] (03CR) 10Jgiannelos: pcs: Enable more wikis for native PCS pregeneration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [15:22:23] (03PS1) 10Vgutierrez: hiera: Restore lvs5004 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1124459 (https://phabricator.wikimedia.org/T384477) [15:22:40] RESOLVED: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:22:40] FIRING: KubernetesRsyslogDown: rsyslog on aux-k8s-worker2004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-worker2004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:22:46] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10601173 (10Jclark-ctr) @VRiley-WMF i did not see any dispatch created for this with Dell yet. Opened one 206453304 [15:23:00] (03PS9) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) [15:23:00] (03PS19) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [15:23:11] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124459 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:24:39] FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance elastic1108-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:24:42] RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:25:23] (03CR) 10CI reject: [V:04-1] snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [15:26:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10601218 (10bking) [15:26:57] (03PS4) 10Fabfur: cache,haproxy: create tmpfile configuration for tls [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) [15:27:29] (03CR) 10Fabfur: cache,haproxy: create tmpfile configuration for tls (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur) [15:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:29:31] (03PS5) 10Sergio Gimeno: [Growth] Enable surfacing structured tasks A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120505 (https://phabricator.wikimedia.org/T385343) [15:29:39] RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance elastic1108-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:30:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74052 and previous config saved to /var/cache/conftool/dbconfig/20250304-153031-root.json [15:30:53] (03CR) 10JMeybohm: services: refactor helmfiles for helmfile 0.171.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto) [15:31:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120505 (https://phabricator.wikimedia.org/T385343) (owner: 10Sergio Gimeno) [15:32:01] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply [15:32:03] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur) [15:33:14] (03PS1) 10Ahmon Dancy: deployment server: Don't pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/1124462 [15:33:23] (03CR) 10Michael Große: [C:03+1] [Growth] Enable surfacing structured tasks A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120505 (https://phabricator.wikimedia.org/T385343) (owner: 10Sergio Gimeno) [15:33:34] !log klausman@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-staging2002.codfw.wmnet with OS bookworm [15:33:56] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:34:02] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:34:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3494 MB (3% inode=98%): /tmp 3494 MB (3% inode=98%): /var/tmp 3494 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [15:35:32] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:35:37] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:36:30] (03PS4) 10Jelto: services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) [15:37:14] Dreamy_Jazz: it's still deploying btw, but it seems likely it won't fail [15:37:20] it's at around 87% [15:37:25] Thanks for the update [15:37:40] (03CR) 10Jelto: services: refactor helmfiles for helmfile 0.171.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto) [15:38:50] (03CR) 10Ssingh: [C:03+1] hiera: Restore lvs5004 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1124459 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:39:23] (03CR) 10Vgutierrez: [C:03+2] hiera: Restore lvs5004 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1124459 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:39:43] ah no, it was at 87% of canaries [15:39:55] (03PS1) 10Sbisson: Enable CX unified dashboard on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124464 (https://phabricator.wikimedia.org/T387818) [15:39:55] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1031 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1124361 (owner: 10Muehlenhoff) [15:40:43] and just broke for mw-api-ext-next again [15:40:45] cc swfrench-wmf [15:40:49] (03PS2) 10Sbisson: Enable CX unified dashboard on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124464 (https://phabricator.wikimedia.org/T387820) [15:41:06] ... [15:41:22] confirmed you've picked up the older image [15:41:36] or rather, newer base image of the older state [15:41:48] sorry, language not fully working yet [15:41:55] !log repooling lvs5004 running liberica - T384477 [15:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:58] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [15:42:06] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply [15:42:36] It's gonna roll back again [15:42:59] I'll try to do a helmfile apply only on those releases see if I can tease out something [15:43:20] so, I have a theory [15:43:53] yesterday, I had to scale up these releases in advance of the next (heh) migration step [15:44:10] claime: this morning [15:44:13] however, I have concerns that a "costly pull" (in this case 2m or so) [15:44:28] doesn't interact well with the number of replicas [15:44:35] claime: after the first time out, ai did another scap sync world and it was done in 4m [15:44:48] combined with the rollingUpdate [15:44:50] which is why I found that the failure was not important [15:45:01] effie: did you see a failure on your first scap? [15:45:26] swfrench-wmf: this morning yes, it was a time out, then I run a sync world again, everything was in place [15:45:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74054 and previous config saved to /var/cache/conftool/dbconfig/20250304-154537-root.json [15:45:57] (03PS7) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 [15:45:57] (03PS2) 10Ayounsi: Move fetch_device_interfaces from wmf-netbox to core Homer [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 [15:46:10] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [15:46:18] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10601316 (10HCoplin-WMF) @Bmueller could you please approve this request? :) [15:46:22] swfrench-wmf: then I had the jobrunner -next deployment not getting traffic, which was due to changeprop, and spent quite a while trying crack that one [15:46:40] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1031.eqiad.wmnet [15:46:46] yeah, the connection pooling in changeprop makes for weirdness like that =/ [15:46:54] now I know :) [15:47:25] claime: I might send you a patch for 2 things, scaling back -next a bit and increasing the maxUnavailable [15:48:19] yeah, I'm waiting for the rollback to finish, and then I'm gonna try a manual helmfile apply on one of the -next just to clear something up [15:48:35] (namely is there some change I didn't see that trips it up) [15:48:49] sound good, thank you! [15:49:38] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply [15:50:17] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply [15:50:49] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply [15:51:08] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [15:51:16] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [15:51:20] (03CR) 10CI reject: [V:04-1] Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [15:51:26] (03CR) 10Ahmon Dancy: "I need some assistance getting the envoy stuff set up properly. The current state is that PCC fails with the following:" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [15:51:31] ok so there were no pending changes [15:51:44] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh) [15:51:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:51:56] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply [15:52:06] (03PS1) 10Ottomata: Revert "eventgate-logging-external - upgrade to node20" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124465 [15:52:30] (03CR) 10CI reject: [V:04-1] Move fetch_device_interfaces from wmf-netbox to core Homer [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [15:52:31] (03PS2) 10Ottomata: Revert "eventgate-logging-external - upgrade to node20" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124465 (https://phabricator.wikimedia.org/T387850) [15:55:23] (03CR) 10Ottomata: [C:03+2] Revert "eventgate-logging-external - upgrade to node20" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124465 (https://phabricator.wikimedia.org/T387850) (owner: 10Ottomata) [15:57:47] (03PS1) 10Scott French: mw-(api-ext|web): shrink next to 80% and increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124468 (https://phabricator.wikimedia.org/T383845) [15:58:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:58:50] (03CR) 10Clément Goubert: [C:03+1] mw-(api-ext|web): shrink next to 80% and increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124468 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:59:58] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:00:02] (03CR) 10Clément Goubert: [C:03+2] mw-(api-ext|web): shrink next to 80% and increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124468 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:00:04] jelto, arnoldokoth, and mutante: May I have your attention please! SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1600) [16:00:17] Dreamy_Jazz: still fighting with it [16:00:40] Ah. Good luck. About to join a meeting, so won't be around for a bit. [16:00:42] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:00:44] (03PS11) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) [16:00:53] (03PS24) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [16:01:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10601383 (10bking) Note to self, there is a problem with our regex in `preseed.yaml` I noticed when working on T387782... [16:01:15] !log eventgate-logging-external: rolling back to pre node 20 due to bug likely caused by T382173. -- T387850 , T383814 [16:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:20] T382173: Enable Event Platform streams to opt out of collecting User-Agent data - https://phabricator.wikimedia.org/T382173 [16:01:21] T387850: NEL logs are missing geoip information - https://phabricator.wikimedia.org/T387850 [16:01:21] T383814: Upgrade eventgate-wikimedia to node20 - https://phabricator.wikimedia.org/T383814 [16:01:26] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [16:01:50] !log jelto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator deploy [16:01:57] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [16:02:07] !log jelto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deploy [16:02:10] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [16:02:15] (03Merged) 10jenkins-bot: mw-(api-ext|web): shrink next to 80% and increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124468 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:02:34] !log brennen@deploy2002 Started deploy [phabricator/deployment@5d2302b]: test deploy phab2002 for T387873 [16:02:36] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply [16:02:37] T387873: Deploy Phabricator/Phorge 2025-03-04 - https://phabricator.wikimedia.org/T387873 [16:02:57] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [16:03:03] !log brennen@deploy2002 Finished deploy [phabricator/deployment@5d2302b]: test deploy phab2002 for T387873 (duration: 00m 29s) [16:03:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:03:17] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [16:03:18] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-staging2002.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:03:33] !log brennen@deploy2002 Started deploy [phabricator/deployment@5d2302b]: deploy phab1004 for T387873 [16:04:24] !log brennen@deploy2002 Finished deploy [phabricator/deployment@5d2302b]: deploy phab1004 for T387873 (duration: 00m 51s) [16:04:35] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:40] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [16:05:15] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-staging2002.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:05:22] !log cgoubert@deploy2002 Started scap sync-world: Shrink -next releases [16:06:40] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124471 [16:07:00] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2002.codfw.wmnet with OS bookworm [16:07:27] !log cgoubert@deploy2002 Finished scap sync-world: Shrink -next releases (duration: 02m 35s) [16:08:00] !log cgoubert@deploy2002 Started scap sync-world: Move image forward [16:08:12] (03PS8) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 [16:08:12] (03PS3) 10Ayounsi: Move fetch_device_interfaces from wmf-netbox to core Homer [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 [16:09:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet [16:09:09] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124471 (owner: 10PipelineBot) [16:10:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:10:44] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124471 (owner: 10PipelineBot) [16:11:01] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply [16:11:58] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10601492 (10Papaul) @Jhancock.wm remove nodes requests from puppetmaster1001. [16:12:24] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:12:28] (03PS3) 10Muehlenhoff: Bump versions of Java 11/17 production images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120544 [16:14:01] (03CR) 10CI reject: [V:04-1] Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [16:14:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3493 MB (3% inode=98%): /tmp 3493 MB (3% inode=98%): /var/tmp 3493 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [16:15:03] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply [16:15:07] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply [16:15:35] (03CR) 10CI reject: [V:04-1] Move fetch_device_interfaces from wmf-netbox to core Homer [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [16:16:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:17:17] !log cgoubert@deploy2002 Finished scap sync-world: Move image forward (duration: 09m 16s) [16:17:24] Dreamy_Jazz: ok, your change is live [16:17:25] finally [16:17:31] Thanks! [16:18:18] I'll proceed with the rest of the changes from the window if itamarWMDE and Daimona are still around? [16:18:55] Hi claime! Yep, still around (I haven't caught up with what happened during the window) [16:19:09] Daimona: we broke deployments, sorry about that [16:19:18] So we are a bit late to the party [16:19:23] but i'll deploy your change [16:20:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet [16:20:18] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump versions of Java 11/17 production images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120544 (owner: 10Muehlenhoff) [16:20:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1031.eqiad.wmnet [16:20:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:20:39] Hehe, SNAFU? :) Anyway, no worries! I was going to reschedule it, but if it can be deployed now it's even better. [16:20:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124172 (https://phabricator.wikimedia.org/T386427) (owner: 10Daimona Eaytoy) [16:21:31] (03Merged) 10jenkins-bot: Enable $wgCampaignEventsSeparateOngoingEvents by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124172 (https://phabricator.wikimedia.org/T386427) (owner: 10Daimona Eaytoy) [16:21:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:21:49] (03CR) 10DCausse: [C:03+2] wdqs: cleanup unused settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113745 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [16:22:01] !log cgoubert@deploy2002 Started scap sync-world: Backport for [[gerrit:1124172|Enable $wgCampaignEventsSeparateOngoingEvents by default (T386427)]] [16:22:04] T386427: Release section UI for Special:AllEvents - https://phabricator.wikimedia.org/T386427 [16:22:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet [16:23:14] (03Merged) 10jenkins-bot: wdqs: cleanup unused settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113745 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [16:24:07] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [16:24:23] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [16:25:34] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage [16:27:26] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [16:27:50] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [16:28:46] (03CR) 10Kamila Součková: [C:03+1] validating-admission-policies: Be more explicit in tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124415 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [16:28:50] !log cgoubert@deploy2002 daimona, cgoubert: Backport for [[gerrit:1124172|Enable $wgCampaignEventsSeparateOngoingEvents by default (T386427)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:28:53] T386427: Release section UI for Special:AllEvents - https://phabricator.wikimedia.org/T386427 [16:28:54] Daimona: your change is live on mw-debug, can you test? [16:29:09] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:29:14] Yup [16:29:16] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage [16:29:47] (03CR) 10Ssingh: wdqs: create query-legacy-full.wikidata.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124197 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [16:30:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet [16:30:52] (03CR) 10DLynch: [C:03+1] Hide "Insert graph" tool in VE when graphs are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123620 (https://phabricator.wikimedia.org/T387501) (owner: 10Esanders) [16:30:57] can we continue with deployments on k8s or there are still issues? I have a few patches to be deployed on mobileapps [16:31:29] nemo-yiannis: issues were only with mediawiki deployments, you should be ok to deploy mobileapps [16:31:39] got it, thanks [16:31:48] (if it doesn't cause an issue that we're deploying mediawiki repeatedly while you do it ofc) [16:31:52] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:31:58] (03CR) 10Ssingh: [C:03+1] wdqs: create query-legacy-full.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/1124197 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [16:32:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74056 and previous config saved to /var/cache/conftool/dbconfig/20250304-163207-root.json [16:32:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74057 and previous config saved to /var/cache/conftool/dbconfig/20250304-163212-root.json [16:32:34] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:33:40] (03PS1) 10Scott French: php8.1: Set display_startup_errors consistent with display_errors [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124476 (https://phabricator.wikimedia.org/T377038) [16:34:32] claime: looks good to me, thank you! [16:34:38] !log cgoubert@deploy2002 daimona, cgoubert: Continuing with sync [16:34:48] cool, deploying to prod then [16:34:58] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:35:42] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:36:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:36:21] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@3eb0316]: Add new wikis. Enable prometheus metrics. [16:37:15] (03CR) 10Muehlenhoff: "See my inline comment, hieradata/role/common/deployment_server/kubernetes.yaml; try adding that and I think PCC will pass." [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [16:37:31] (03PS21) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [16:37:43] (03CR) 10Effie Mouzeli: [C:03+1] php8.1: Set display_startup_errors consistent with display_errors [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124476 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [16:38:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1031.eqiad.wmnet to cluster eqiad and group A [16:39:16] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1031.eqiad.wmnet to cluster eqiad and group A [16:41:27] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:42:55] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:42:56] (03PS22) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [16:43:29] !log cgoubert@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124172|Enable $wgCampaignEventsSeparateOngoingEvents by default (T386427)]] (duration: 21m 28s) [16:43:32] T386427: Release section UI for Special:AllEvents - https://phabricator.wikimedia.org/T386427 [16:43:41] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:43:42] itamarWMDE: are you still around? [16:44:35] ok I'm going to backport my change then [16:45:00] and if itamarWMDE don't manifest themself before I'm done I think we'll have to reschedule that patch [16:45:23] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:46:01] (03PS23) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [16:46:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [16:46:53] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:47:07] (03Merged) 10jenkins-bot: Revert^2 "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [16:47:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74058 and previous config saved to /var/cache/conftool/dbconfig/20250304-164712-root.json [16:47:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74059 and previous config saved to /var/cache/conftool/dbconfig/20250304-164718-root.json [16:47:36] !log cgoubert@deploy2002 Started scap sync-world: Backport for [[gerrit:1124051|Revert^2 "When executing cli scripts, wait for the service mesh" (T387208)]] [16:47:38] T387208: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208 [16:48:16] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2002.codfw.wmnet with OS bookworm [16:48:19] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mobile for WRai-WMF - https://phabricator.wikimedia.org/T387786#10601796 (10thcipriani) [16:48:37] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mobile for WRai-WMF - https://phabricator.wikimedia.org/T387786#10601799 (10thcipriani) Approved as group approver [16:49:20] (03PS1) 10JMeybohm: mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478 [16:49:39] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2001.codfw.wmnet with OS bookworm [16:49:46] (03PS19) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) [16:50:32] !log cgoubert@deploy2002 cgoubert, oblivian: Backport for [[gerrit:1124051|Revert^2 "When executing cli scripts, wait for the service mesh" (T387208)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:50:59] !log klausman@cumin2002 START - Cookbook sre.hosts.move-vlan for host ml-staging2001 [16:51:10] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [16:51:28] !log cgoubert@deploy2002 cgoubert, oblivian: Continuing with sync [16:52:18] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [16:52:34] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [16:52:55] (03CR) 10Clément Goubert: [C:03+1] mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478 (owner: 10JMeybohm) [16:53:44] (03CR) 10Ryan Kemper: wdqs: create query-legacy-full.wikidata.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124197 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [16:55:09] !log klausman@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-staging2001 - klausman@cumin2002" [16:55:14] !log klausman@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-staging2001 - klausman@cumin2002" [16:55:14] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:55:15] !log klausman@cumin2002 START - Cookbook sre.dns.wipe-cache ml-staging2001.codfw.wmnet 201.0.192.10.in-addr.arpa 1.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:55:18] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-staging2001.codfw.wmnet 201.0.192.10.in-addr.arpa 1.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:55:19] !log klausman@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ml-staging2001 [16:55:30] !log klausman@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-staging2001 [16:55:30] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-staging2001 [16:56:13] (03CR) 10JMeybohm: [C:03+1] services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto) [16:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:57:46] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@3eb0316]: Add new wikis. Enable prometheus metrics. (duration: 21m 25s) [16:57:48] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled: k8s-ingress-ml-staging_31443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:58:01] that is klausman I am assuming [16:58:18] !log cgoubert@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124051|Revert^2 "When executing cli scripts, wait for the service mesh" (T387208)]] (duration: 10m 42s) [16:58:20] T387208: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208 [16:58:24] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled: k8s-ingress-ml-staging_31443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:59:00] (03CR) 10Ryan Kemper: [C:03+2] wdqs: create query-legacy-full.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/1124197 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [16:59:09] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [16:59:10] itamarWMDE: still not around? [16:59:21] !log ryankemper@dns1004 START - running authdns-update [17:00:05] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1700). [17:00:06] bd808: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:35] greetings [17:00:45] itamarWMDE: Sorry, I'll close the backport window, if you'd be so kind as to reschedule your backport for a future window [17:00:54] !log Closing UTC afternoon backport window [17:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:26] going ahead with that puppet patch if nothing else is in progress -- I don't think it'll conflict anyway [17:02:00] !log ryankemper@dns1004 END - running authdns-update [17:02:11] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [17:02:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74060 and previous config saved to /var/cache/conftool/dbconfig/20250304-170217-root.json [17:02:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1214 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74061 and previous config saved to /var/cache/conftool/dbconfig/20250304-170223-root.json [17:03:03] (03CR) 10RLazarus: [C:03+2] deployment-prep: Remove parsoid things from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1117997 (https://phabricator.wikimedia.org/T385849) (owner: 10BryanDavis) [17:03:22] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [17:03:47] (03PS1) 10Ahmon Dancy: envoy: Update examples [puppet] - 10https://gerrit.wikimedia.org/r/1124479 [17:04:58] puppet window complete, unless anyone has anything else [17:05:17] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.169s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:05:23] bd808: feel free to ping me if you need a followup [17:06:29] ... funny time for a coincidental parsoid alert but that's not real [17:08:01] (03PS1) 10Zoe: Set Flow to read-only on remaining phase 2a wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124480 (https://phabricator.wikimedia.org/T378834) [17:08:15] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [17:08:28] (03CR) 10Ahmon Dancy: "Thanks! PCC didn't pass but it moved us forward to something weirder and seemingly unrelated:" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [17:09:51] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:10:17] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.614s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:11:04] (03PS1) 10Ryan Kemper: wdqs: fix query-legacy-full cert typo [puppet] - 10https://gerrit.wikimedia.org/r/1124481 (https://phabricator.wikimedia.org/T384422) [17:11:42] rzl: thanks. I apparently ignored all the pings here :/ [17:11:47] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:12:37] no worries! I'd've been more persistent if it looked like the patch needed real-time testing [17:12:43] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:14:22] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124481 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [17:14:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10601911 (10phaultfinder) [17:15:57] (03PS2) 10Ryan Kemper: wdqs: fix query-legacy-full cert typo [puppet] - 10https://gerrit.wikimedia.org/r/1124481 (https://phabricator.wikimedia.org/T384422) [17:16:19] (03PS1) 10DCausse: cirrus-streaming-updater: add explicit consumer/producer streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124483 (https://phabricator.wikimedia.org/T375821) [17:16:21] (03PS1) 10DCausse: cirrus-streaming-updater: consume from new v1 & legacy rc0 streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124484 (https://phabricator.wikimedia.org/T375821) [17:16:22] (03PS1) 10DCausse: cirrus-streaming-updater: produce to v1 update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124485 (https://phabricator.wikimedia.org/T375821) [17:16:24] (03PS1) 10DCausse: cirrus-streaming-updater: stop consuming from legacy streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124486 (https://phabricator.wikimedia.org/T375821) [17:17:00] (03CR) 10Ryan Kemper: [C:03+2] wdqs: fix query-legacy-full cert typo [puppet] - 10https://gerrit.wikimedia.org/r/1124481 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [17:17:00] (03CR) 10Ssingh: [C:03+1] wdqs: fix query-legacy-full cert typo [puppet] - 10https://gerrit.wikimedia.org/r/1124481 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [17:17:13] (03CR) 10KartikMistry: Enable CX unified dashboard on phase 2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124464 (https://phabricator.wikimedia.org/T387820) (owner: 10Sbisson) [17:17:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74062 and previous config saved to /var/cache/conftool/dbconfig/20250304-171722-root.json [17:18:36] (03CR) 10Ahmon Dancy: "One thing that automated use of `-Dfull_image_build:True` provides is _assurance_ that the latest base image is used at the start of each " [puppet] - 10https://gerrit.wikimedia.org/r/1124462 (owner: 10Ahmon Dancy) [17:20:38] rzl: any objections if I move forward with a potentially long-running scap deployment during the remainder of your window? [17:20:53] (03CR) 10DCausse: [C:04-2] "not ready yet" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124484 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [17:20:58] swfrench-wmf: no objections, fire away [17:21:13] (03CR) 10Ebernhardson: [C:03+1] cirrus-streaming-updater: add explicit consumer/producer streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124483 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [17:21:24] rzl: awesome, thank you! [17:22:32] (03CR) 10Scott French: "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124476 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [17:23:42] (03PS1) 10Jcrespo: dbbackups: Prepare backup1013 to take over eqiad backups of es* dbs [puppet] - 10https://gerrit.wikimedia.org/r/1124487 (https://phabricator.wikimedia.org/T387892) [17:24:07] (03PS2) 10Hnowlan: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) [17:24:41] !log klausman@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-staging2001.codfw.wmnet with OS bookworm [17:25:09] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2001.codfw.wmnet with OS bookworm [17:25:59] (03CR) 10Scott French: [V:03+2] "Verified locally by way of confirming shellbox issue is resolved in local testing." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124476 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [17:26:05] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: Set display_startup_errors consistent with display_errors [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124476 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [17:31:47] !log built php8.1 production images with 'php8.1: Set display_startup_errors consistent with display_errors' - T377038 [17:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:50] T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038 [17:32:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74064 and previous config saved to /var/cache/conftool/dbconfig/20250304-173228-root.json [17:32:47] (03PS1) 10Bvibber: Fix typo in wgTrackGlobalJsonLinksNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124488 (https://phabricator.wikimedia.org/T387843) [17:33:36] !log swfrench@deploy2002 Started scap sync-world: Use latest php8.1 images - T377038 [17:34:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10602032 (10phaultfinder) [17:34:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3583 MB (3% inode=98%): /tmp 3583 MB (3% inode=98%): /var/tmp 3583 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [17:35:52] (03CR) 10Reedy: [C:04-1] "Needs fixing in CommonSettings-labs too (even if it'll go away with the config refactoring patches that are in progress)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124488 (https://phabricator.wikimedia.org/T387843) (owner: 10Bvibber) [17:35:54] (03CR) 10Jforrester: [C:03+1] "<3 Sorry I never wrote the config static analyser for exactly these kinds of issues." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124488 (https://phabricator.wikimedia.org/T387843) (owner: 10Bvibber) [17:36:21] (03CR) 10Jforrester: [C:03+1] "Config got the DB auto-schema updates so is that an issue? (But yes, cleanup good.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124488 (https://phabricator.wikimedia.org/T387843) (owner: 10Bvibber) [17:37:27] (03PS2) 10Bvibber: Fix typo in wgTrackGlobalJsonLinksNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124488 (https://phabricator.wikimedia.org/T387843) [17:37:53] (03CR) 10Bvibber: "done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124488 (https://phabricator.wikimedia.org/T387843) (owner: 10Bvibber) [17:38:13] (03CR) 10Bvibber: "(i hadn't saved the file. god damnnit it's not my day lol)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124488 (https://phabricator.wikimedia.org/T387843) (owner: 10Bvibber) [17:38:14] (03CR) 10Jforrester: [C:03+1] Fix typo in wgTrackGlobalJsonLinksNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124488 (https://phabricator.wikimedia.org/T387843) (owner: 10Bvibber) [17:38:22] PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/9e4721c7392b9befabb86c7bca857168069c2b91ff82bd452ca64542919689b2/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [17:39:14] so i can schedule that for 1pm pacific time or i can leave y'all to rush it if you want [17:39:27] :D [17:39:41] i'm definitely doing my guard prefs more carefully next time, this was just sloppy. i apologize. [17:42:34] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387868#10602078 (10Pppery) →14Duplicate dup:03T382984 [17:42:36] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10602080 (10Pppery) [17:42:49] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2001.codfw.wmnet with reason: host reimage [17:43:45] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: add explicit consumer/producer streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124483 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [17:45:29] bvibber: I am here! [17:45:31] (03Merged) 10jenkins-bot: cirrus-streaming-updater: add explicit consumer/producer streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124483 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [17:45:36] \o/ [17:45:38] whee [17:46:40] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2001.codfw.wmnet with reason: host reimage [17:46:46] I'll deploy it [17:46:48] jouncebot: now [17:46:49] For the next 0 hour(s) and 13 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1700) [17:46:50] FYI, there's a deployment ongoing attempting to confirm the fix for the issues that happened during the earluer backport window [17:46:52] thanks hashar <3 [17:46:59] please do not deploy [17:47:02] aho [17:47:04] hashar: ^ [17:47:29] just when I was going to ask in -sre hehe [17:47:31] perfect timing [17:47:35] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:48:05] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:48:20] apologies for the added complexity / out-of-schedule deployment - we wanted to get this test started early to make sure we avoid blocking the train window coming up [17:49:33] (03PS7) 10Bartosz Dziewoński: Deduplicate JsonConfig config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 [17:49:33] (03PS3) 10Sbisson: Enable CX unified dashboard on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124464 (https://phabricator.wikimedia.org/T387820) [17:49:36] (03CR) 10Vgutierrez: [C:03+1] cache,haproxy: create tmpfile configuration for tls [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur) [17:49:58] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:50:32] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:51:02] (03CR) 10Sbisson: Enable CX unified dashboard on phase 2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124464 (https://phabricator.wikimedia.org/T387820) (owner: 10Sbisson) [17:52:16] (03CR) 10Bartosz Dziewoński: "We have a tool that detects occurrences of config settings that no longer exist: https://wikitech.wikimedia.org/wiki/Technical_debt/Unused" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124488 (https://phabricator.wikimedia.org/T387843) (owner: 10Bvibber) [17:52:44] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:53:21] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:53:34] PROBLEM - SSH on an-presto1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:54:32] swfrench-wmf: is there a task and where can we sync to watch the progress? :) [17:54:51] or it is base images being rebuilt isn't it? [17:55:13] it's the base image rebuild, yeah [17:55:25] (03CR) 10Hashar: [C:03+1] "This should be deployed at any time in order to unblock the train. Sync up in #wikimedia-operations though, SRE are currently doing an op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124488 (https://phabricator.wikimedia.org/T387843) (owner: 10Bvibber) [17:55:51] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:56:00] i still can't believe i went to all the trouble of putting in a config flag ahead of time and i fucking copy-pasted it spelled wrong [17:56:22] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:56:40] (03PS1) 10Sergio Gimeno: analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124493 (https://phabricator.wikimedia.org/T387286) [17:56:45] *smdh* good point about the tooling for detecting unknown/obsolete config vars :D could improve that yet hehe [17:56:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124493 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [17:57:16] (03PS1) 10Sergio Gimeno: analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124494 (https://phabricator.wikimedia.org/T387286) [17:57:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124494 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [17:58:05] !log swfrench@deploy2002 Finished scap sync-world: Use latest php8.1 images - T377038 (duration: 24m 53s) [17:58:07] T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038 [17:58:57] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:59:20] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10602137 (10Bmueller) Approved! Thanks for the ping @HCoplin-WMF :-) [17:59:23] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:59:28] RECOVERY - SSH on an-presto1014 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:00:05] swfrench-wmf: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1800). [18:00:19] (03CR) 10Jcrespo: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1124487/5025/" [puppet] - 10https://gerrit.wikimedia.org/r/1124487 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [18:03:29] bvibber: so once the feature flag is fixed we are set apparently. I have a question though, when I promoted group0, I only had a few thousands queries all from jobs. So that is code running on a timer or only triggered in some case? [18:05:32] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2001.codfw.wmnet with OS bookworm [18:07:11] hashar: I have one more change to get out during this infra window that does not require a scap deployment (helmfile only). I can hold on that if you need to get this fix out to unblock the train. [18:07:42] swfrench-wmf: na na go ahead and finish what you are doing :) [18:08:02] I'll do bvibber config change after [18:08:16] :thumbs-up: [18:08:34] ack, whatever works for you folks :) [18:08:45] and there is the train window in an hour for the rest ( https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1900 ) [18:09:21] just make sure what you did work, and if something is off we will just further delay the train by whatever time it takes to get things fixed [18:09:42] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2003.codfw.wmnet with OS bookworm [18:10:04] !log klausman@cumin2002 START - Cookbook sre.hosts.move-vlan for host ml-staging2003 [18:10:05] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-staging2003 [18:10:07] as long as we are all in line and prod does not explodes, I am all good :-] [18:10:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124480 (https://phabricator.wikimedia.org/T378834) (owner: 10Zoe) [18:11:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:12:11] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124210 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:12:13] (03CR) 10Scott French: [C:03+2] mw-api-int: serve 25% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124210 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:13:34] (03Merged) 10jenkins-bot: mw-api-int: serve 25% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124210 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:15:21] claime: apologies, I didn't notice I scheduled the patch over a team meeting I was taking part in, I will reschedule it again (and put a cal entry this time 🤦‍♂️) [18:15:25] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [18:15:46] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [18:16:07] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [18:16:32] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [18:18:15] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [18:18:54] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [18:19:14] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [18:19:27] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [18:20:27] !log serving 25% of mw-api-int traffic on PHP 8.1 - T383845 [18:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:30] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:23:14] alright, I'm done with work planned for the infra window [18:23:30] moaaar php8.1! [18:23:32] hashar: bvibber: feel free to move forward with your backport if you'd like [18:23:35] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2003.codfw.wmnet with reason: host reimage [18:23:36] :) [18:24:04] cool I am going for bvibber patch [18:26:16] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2003.codfw.wmnet with reason: host reimage [18:27:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124488 (https://phabricator.wikimedia.org/T387843) (owner: 10Bvibber) [18:27:54] (03PS4) 10Bernard Wang: Deploy Search AB test to everywhere but English wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122684 (https://phabricator.wikimedia.org/T386849) [18:28:20] (03Merged) 10jenkins-bot: Fix typo in wgTrackGlobalJsonLinksNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124488 (https://phabricator.wikimedia.org/T387843) (owner: 10Bvibber) [18:28:32] \o/ [18:28:53] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1124488|Fix typo in wgTrackGlobalJsonLinksNamespaces (T387843 T385917)]] [18:28:58] T387843: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'gjlw_namespace' in 'where clause' - https://phabricator.wikimedia.org/T387843 [18:28:58] T385917: Deploy patch-gjlw_namespace_text.sql on x1.commonswiki for JsonConfig - https://phabricator.wikimedia.org/T385917 [18:30:43] (03PS2) 10Reedy: CommonSettings-labs.php: Fix $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124496 [18:36:09] !log hashar@deploy2002 hashar, bvibber: Backport for [[gerrit:1124488|Fix typo in wgTrackGlobalJsonLinksNamespaces (T387843 T385917)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:36:13] T387843: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'gjlw_namespace' in 'where clause' - https://phabricator.wikimedia.org/T387843 [18:36:13] T385917: Deploy patch-gjlw_namespace_text.sql on x1.commonswiki for JsonConfig - https://phabricator.wikimedia.org/T385917 [18:36:53] !log hashar@deploy2002 hashar, bvibber: Continuing with sync [18:38:43] (03PS2) 10Hashar: deployment server: Don't pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/1124462 (https://phabricator.wikimedia.org/T387823) (owner: 10Ahmon Dancy) [18:41:39] (03CR) 10Hashar: [C:03+1] "The context is T387823 and I have amended the commit message to attach this change to it. In short:" [puppet] - 10https://gerrit.wikimedia.org/r/1124462 (https://phabricator.wikimedia.org/T387823) (owner: 10Ahmon Dancy) [18:41:59] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2003.codfw.wmnet with OS bookworm [18:43:45] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124488|Fix typo in wgTrackGlobalJsonLinksNamespaces (T387843 T385917)]] (duration: 14m 51s) [18:43:49] T387843: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'gjlw_namespace' in 'where clause' - https://phabricator.wikimedia.org/T387843 [18:43:49] T385917: Deploy patch-gjlw_namespace_text.sql on x1.commonswiki for JsonConfig - https://phabricator.wikimedia.org/T385917 [18:45:09] swfrench-wmf: bvibber: pach deployed! I will run the train in 15 minutes :) [18:45:26] woot [18:45:30] hopefully no more explodey [18:46:24] awesome :) [18:51:06] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [18:53:02] (03PS1) 10Dbrant: Remove unused config parameters from ReadingLists extension. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124500 [18:54:09] (03CR) 10Scott French: shellbox-media: serve 1/8 of requests on 8.1 with more logging (2) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124388 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [18:54:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10602362 (10phaultfinder) [18:54:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3484 MB (3% inode=98%): /tmp 3484 MB (3% inode=98%): /var/tmp 3484 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [18:59:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124500 (owner: 10Dbrant) [19:00:05] hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T1900) [19:01:15] (03PS1) 10Bking: cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) [19:01:44] ok so [19:01:56] tonight is Charlotte deploying the train. She is my daughter and is 10 years old [19:02:02] under my supervision of course! :b [19:02:49] 🍿 [19:03:06] I have been saying for months that running the train is a no brainer nowadays :) [19:03:18] <3 [19:05:22] (03PS2) 10Bking: cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) [19:05:40] dancy: my kid is an awe about the ascii art train and the rail track [19:05:53] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [19:05:57] haha. I'm glad. [19:06:40] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124503 (https://phabricator.wikimedia.org/T386214) [19:06:42] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124503 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [19:06:52] Charlotte quote: "your job is as simple as that?" [19:07:08] me: yeah we are working very hard to make THAT simple :b [19:07:34] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124503 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [19:11:26] unfortunately my train conductor has go brush her teeth so I am taking over :) [19:11:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:12:04] (03PS1) 10TChin: mw-content-history-reconcile-enrich: Bump taskmanager memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124504 (https://phabricator.wikimedia.org/T387906) [19:13:38] (03CR) 10Scott French: [C:03+1] "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [19:15:01] (03CR) 10Xcollazo: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124504 (https://phabricator.wikimedia.org/T387906) (owner: 10TChin) [19:20:33] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.19 refs T386214 [19:20:36] T386214: 1.44.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T386214 [19:21:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:22:58] CampaignEvents has some undefined array elements [19:25:31] ooh spammy [19:27:10] (03CR) 10Cathal Mooney: Expose _gql_execute to wmf-netbox + fetch GQL queries from files (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [19:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:30:26] yeah I filed the CampaignEvents one at https://phabricator.wikimedia.org/T387914 [19:30:39] it seems to be merely backend log noise with no user impact as I can tell [19:30:51] I will poke the devs [19:30:59] thx [19:32:23] Daimona: it is harmless isn't it? [19:32:34] or should we roll back? [19:32:46] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@10615c9]: Deploy latet DAGs for analytics Airflow instance. T387906. [19:32:50] T387906: Investigate why the mw-content-history-reconcile-enrich Flink job failed. - https://phabricator.wikimedia.org/T387906 [19:33:20] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@10615c9]: Deploy latet DAGs for analytics Airflow instance. T387906. (duration: 00m 34s) [19:35:34] dancy: Daimona answer on the task. I guess they are having dinner and said they will come to it in an hour or so [ https://phabricator.wikimedia.org/T387914#10602619 ] [19:35:40] I think it is fine to let it there for now [19:35:56] OK. I'll keep an eye on it. [19:36:05] beside that it looks quiet [19:36:44] and the issue that caused me to rollback earlier today did not happen (bvibber fixed the typo in the feature flag that was supposed to disable the feature that relies on a db schema change which is not fully deployed) [19:36:50] damn that is a long sentence [19:36:58] so I think it is pretty much set [19:37:35] also scap clean now delete the old branches without any log spam which concludes a 3+ months journey \o/ [19:37:55] Congrats! [19:38:08] I hand off the train to you :-] [19:48:14] (03CR) 10Kimberly Sarabia: Deploy Search AB test to everywhere but English wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122684 (https://phabricator.wikimedia.org/T386849) (owner: 10Bernard Wang) [19:48:57] (03CR) 10TChin: [C:03+2] mw-content-history-reconcile-enrich: Bump taskmanager memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124504 (https://phabricator.wikimedia.org/T387906) (owner: 10TChin) [19:50:10] (03Merged) 10jenkins-bot: mw-content-history-reconcile-enrich: Bump taskmanager memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124504 (https://phabricator.wikimedia.org/T387906) (owner: 10TChin) [19:51:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:54:56] (03PS1) 10Jforrester: wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124509 [19:56:08] (03CR) 10Bernard Wang: Deploy Search AB test to everywhere but English wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122684 (https://phabricator.wikimedia.org/T386849) (owner: 10Bernard Wang) [19:57:56] (03PS1) 10Bernard Wang: Enable Search AB test for en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 [19:58:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang) [19:58:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang) [19:58:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang) [20:01:10] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:03:09] (03PS5) 10Bernard Wang: Deploy Search AB test to everywhere but English wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122684 (https://phabricator.wikimedia.org/T386849) [20:03:36] (03CR) 10Kimberly Sarabia: [C:03+1] Deploy Search AB test to everywhere but English wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122684 (https://phabricator.wikimedia.org/T386849) (owner: 10Bernard Wang) [20:04:55] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns names for test servers nokia lab - cmooney@cumin1002" [20:05:18] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns names for test servers nokia lab - cmooney@cumin1002" [20:05:19] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:08:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123417 (owner: 10Jforrester) [20:08:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123418 (owner: 10Jforrester) [20:08:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123419 (owner: 10Jforrester) [20:09:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123420 (owner: 10Jforrester) [20:09:19] (03PS3) 10Bking: cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) [20:10:58] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387597#10602741 (10Dzahn) >>! In T387597#10599523, @fgiunchedi wrote: > The bot will update the existing open task if more than one host is alerting, under `firing alerts` all hosts (in a given site) will be l... [20:14:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3563 MB (3% inode=98%): /tmp 3563 MB (3% inode=98%): /var/tmp 3563 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [20:18:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [20:20:16] (03PS4) 10Herron: KubernetesRsyslogDown: bump threshold to 15m [alerts] - 10https://gerrit.wikimedia.org/r/1124453 (https://phabricator.wikimedia.org/T381417) [20:29:20] (03PS4) 10Bking: cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) [20:29:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [20:33:24] (03PS2) 10Reedy: CommonSettings.php: Remove $wgSecurePollGPGCommand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124514 [20:39:36] (03PS2) 10Reedy: Remove $wgExternalLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124515 [20:39:52] (03CR) 10CI reject: [V:04-1] Remove $wgExternalLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124515 (owner: 10Reedy) [20:40:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2013.codfw.wmnet with OS bookworm [20:40:15] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10602866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm [20:40:31] (03PS5) 10Bking: cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) [20:40:41] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [20:41:12] (03PS3) 10Reedy: Remove $wgTemplateLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124517 [20:42:06] (03PS3) 10Reedy: Remove $wgExternalLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124515 [20:44:14] (03PS5) 10Reedy: Remove $wgTemplateLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124517 [20:47:30] (03PS2) 10Reedy: CommonSettings.php: Rename $wgTranslateServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124519 [20:47:42] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10602894 (10dduvall) Is there something else filling up `/srv`? It has filled back up and `docker system df` hasn't changed much. ` dduvall@deploy2002:~$ df -... [20:50:42] (03PS6) 10Bking: cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) [20:50:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [20:52:26] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:53:26] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:56:24] (03PS7) 10Bking: cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) [20:56:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [20:56:34] 06SRE, 07Kubernetes: Remove `.cluster.local.` suffix in PTR responses - https://phabricator.wikimedia.org/T376762#10602962 (10Aklapper) [20:57:08] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [20:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:57:53] 06SRE, 06DBA, 07Datacenter-Switchover: Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default) - https://phabricator.wikimedia.org/T207385#10602967 (10Aklapper) [20:59:04] (03PS2) 10Reedy: Remove $wgReadingListsCluster/$wgReadingListsDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124520 [20:59:09] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T2100). [21:00:05] Krinkle, Jdlrobson, tgr, MichaelG_WMF, sergi0, and James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] o/ [21:00:11] o/ [21:00:12] Ouch. [21:00:30] Hello [21:00:33] This deploy is rather over-full. Three different backports plus the SUL roll-out again? [21:00:37] Who is taking it? [21:01:00] I can, if needed. [21:01:31] o/ [21:01:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [skins/Vector] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123713 (https://phabricator.wikimedia.org/T358910) (owner: 10Jdlrobson) [21:01:43] o/ [21:01:49] Jdlrobson: You're up first. [21:02:00] 👍 [21:02:07] Hmm, actually, can I do Krinkle's config change first? [21:02:18] Krinkle: How testable is your patch? [21:02:27] trivial, just one curl request. [21:02:33] Ack, let's do that then. [21:02:35] I suggest we combine a few given it's 12 patches in an hour. [21:02:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123810 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [21:02:51] Yeah, my four will go in one go. [21:03:18] (And are totally delayable if needed, just clean-up.) [21:03:25] (03Merged) 10jenkins-bot: docroot: Enable Chrome credential sharing on all open SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123810 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [21:03:31] I can go to end of queue and self-deploy, want to do a bit of extra testing [21:03:44] tgr_: Ack, will delay yours then. [21:03:47] Jdlrobson: are you using the web team window after this one? [21:03:56] tgr_: i dont think so let me check [21:04:07] toyofuku jan_drewniak are we using it today? [21:04:09] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1123810|docroot: Enable Chrome credential sharing on all open SUL wikis (T385520)]] [21:04:12] T385520: Deploy DAL files for seamless credential sharing in Chrome - https://phabricator.wikimedia.org/T385520 [21:04:21] (03CR) 10Dzahn: "There is a deeper rabbit hole problem here and this is what happens:" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [21:04:36] Michael's and my changes can go together (except for the config change that needs to go last) [21:04:44] sergi0: Ack. [21:05:41] tgr_: yes, we have one config changes we want to deploy during the web team backport window today. [21:05:46] So I'll do Krinkle's config, then Jdlrobson's, and then Michael and sergi0's backports, then sergi0's config plus mine. [21:05:58] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:06:23] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:06:38] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:07:11] !log jforrester@deploy2002 jforrester, krinkle: Backport for [[gerrit:1123810|docroot: Enable Chrome credential sharing on all open SUL wikis (T385520)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:14] Krinkle: Please check. [21:07:42] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:07:48] https://en.wikipedia.org/.well-known/assetlinks.json and https://auth.wikimedia.org/.well-known/assetlinks.json LGTM on WikimediaDebug: kus-mwdebug [21:07:54] Go ahead [21:07:58] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:08:04] !log jforrester@deploy2002 jforrester, krinkle: Continuing with sync [21:08:32] (03CR) 10Dbrant: "haha I already had a patch going: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1124500" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124520 (owner: 10Reedy) [21:08:39] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:12:18] (03PS8) 10Bking: cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) [21:12:35] (03Merged) 10jenkins-bot: Revert "styles: Remove transparent PNG fallback for `.vector-icon`" [skins/Vector] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123713 (https://phabricator.wikimedia.org/T358910) (owner: 10Jdlrobson) [21:13:08] * James_F drums his fingers. [21:14:42] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123810|docroot: Enable Chrome credential sharing on all open SUL wikis (T385520)]] (duration: 10m 33s) [21:14:45] T385520: Deploy DAL files for seamless credential sharing in Chrome - https://phabricator.wikimedia.org/T385520 [21:14:45] (03PS2) 10Reedy: CommonSettings.php: Remove $wgCodeEditorEnableCore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124524 [21:14:49] Finally. [21:15:04] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [21:15:06] I'll run the merge for the next batch. [21:15:13] (03CR) 10Jforrester: [C:03+2] fix(surfacing): don't show highlights on protected pages [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124449 (owner: 10Michael Große) [21:15:13] (03CR) 10Jforrester: [C:03+2] fix(surfacing): don't show highlights on protected pages [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124451 (owner: 10Michael Große) [21:15:14] (03CR) 10Jforrester: [C:03+2] analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124493 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [21:15:16] (03CR) 10Jforrester: [C:03+2] analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124494 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [21:15:21] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1123713|Revert "styles: Remove transparent PNG fallback for `.vector-icon`" (T358910 T387351)]] [21:15:25] T358910: [M] Open image full screen for Image Recs - https://phabricator.wikimedia.org/T358910 [21:15:25] T387351: [regression] ToC icons appears as black squares while assets are loading - https://phabricator.wikimedia.org/T387351 [21:16:13] (03PS3) 10Reedy: CommonSettings.php: Remove duplicate load for CodeEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124525 [21:18:23] !log jforrester@deploy2002 jforrester, jdlrobson: Backport for [[gerrit:1123713|Revert "styles: Remove transparent PNG fallback for `.vector-icon`" (T358910 T387351)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:18:25] Jdlrobson: Please check. [21:18:28] on it [21:18:45] James_F: LGTM please sync! [21:18:48] !log jforrester@deploy2002 jforrester, jdlrobson: Continuing with sync [21:18:50] Ta. [21:20:43] Net split during deploy, how helpful. [21:21:07] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10603063 (10Dzahn) It's still /srv/docker/ (/srv/docker/overlay2) . Back to using about 164GB. [21:22:09] re-routing in Libera [21:22:09] ? [21:22:11] And North America is back again 👋 [21:22:27] Welcome back, Europe. [21:23:42] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2045 [21:23:47] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:23:53] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2013.codfw.wmnet with OS bookworm [21:24:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm executed with errors: - backu... [21:24:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2045 [21:25:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:25:34] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123713|Revert "styles: Remove transparent PNG fallback for `.vector-icon`" (T358910 T387351)]] (duration: 10m 13s) [21:25:38] T358910: [M] Open image full screen for Image Recs - https://phabricator.wikimedia.org/T358910 [21:25:38] T387351: [regression] ToC icons appears as black squares while assets are loading - https://phabricator.wikimedia.org/T387351 [21:25:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124449 (owner: 10Michael Große) [21:25:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124451 (owner: 10Michael Große) [21:25:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124493 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [21:25:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124494 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [21:26:19] (03Merged) 10jenkins-bot: fix(surfacing): don't show highlights on protected pages [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124449 (owner: 10Michael Große) [21:26:20] (03Merged) 10jenkins-bot: fix(surfacing): don't show highlights on protected pages [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124451 (owner: 10Michael Große) [21:27:30] (03Merged) 10jenkins-bot: analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124493 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [21:27:31] (03Merged) 10jenkins-bot: analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data [extensions/GrowthExperiments] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124494 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [21:28:04] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1124449|fix(surfacing): don't show highlights on protected pages]], [[gerrit:1124451|fix(surfacing): don't show highlights on protected pages]], [[gerrit:1124493|analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data (T387286)]], [[gerrit:1124494|analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to ev [21:28:04] ent data (T387286)]] [21:28:08] T387286: Track variant assignment on account creation - https://phabricator.wikimedia.org/T387286 [21:28:13] thanks James_F [21:28:21] Happy to help. [21:30:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:31:34] Hmm, scap stuck? No updates for 90s. [21:32:07] Finally. [21:32:53] !log jforrester@deploy2002 sgimeno, jforrester, migr: Backport for [[gerrit:1124449|fix(surfacing): don't show highlights on protected pages]], [[gerrit:1124451|fix(surfacing): don't show highlights on protected pages]], [[gerrit:1124493|analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data (T387286)]], [[gerrit:1124494|analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to [21:32:53] event data (T387286)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:32:54] MichaelG_WMF, sergi0: Please check. :-) [21:33:11] testing [21:33:16] I was optimistic anda already tested my change (on testwiki) and it works as expected 👍 [21:33:26] Excellent. [21:34:50] The event data also lgtm [21:34:54] Ace. [21:34:55] !log jforrester@deploy2002 sgimeno, jforrester, migr: Continuing with sync [21:35:07] With these, the config change can go next? [21:35:22] correct [21:35:27] Brill. [21:35:57] And then I'll hand over to tgr_. [21:39:43] (03CR) 10Jforrester: [C:03+2] [Growth] Enable surfacing structured tasks A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120505 (https://phabricator.wikimedia.org/T385343) (owner: 10Sergio Gimeno) [21:40:24] (03Merged) 10jenkins-bot: [Growth] Enable surfacing structured tasks A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120505 (https://phabricator.wikimedia.org/T385343) (owner: 10Sergio Gimeno) [21:40:50] (03CR) 10Jforrester: [C:03+2] IS: Stop setting wgParserConf, unused since MW 1.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123417 (owner: 10Jforrester) [21:40:53] (03CR) 10Jforrester: [C:03+2] CS: Stop setting wgTmhWebPlayer, unused since TMH REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123418 (owner: 10Jforrester) [21:40:55] (03CR) 10Jforrester: [C:03+2] CS: Stop setting wgBabelUseDatabase, unused since Babel REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123419 (owner: 10Jforrester) [21:40:57] (03CR) 10Jforrester: [C:03+2] CS-labs: Stop setting wgUrlShortenerDB*, unused since UrlShortener REL1_41 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123420 (owner: 10Jforrester) [21:41:32] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124449|fix(surfacing): don't show highlights on protected pages]], [[gerrit:1124451|fix(surfacing): don't show highlights on protected pages]], [[gerrit:1124493|analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data (T387286)]], [[gerrit:1124494|analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to e [21:41:32] vent data (T387286)]] (duration: 13m 27s) [21:41:33] (03Merged) 10jenkins-bot: IS: Stop setting wgParserConf, unused since MW 1.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123417 (owner: 10Jforrester) [21:41:35] T387286: Track variant assignment on account creation - https://phabricator.wikimedia.org/T387286 [21:41:37] (03Merged) 10jenkins-bot: CS: Stop setting wgTmhWebPlayer, unused since TMH REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123418 (owner: 10Jforrester) [21:41:40] (03Merged) 10jenkins-bot: CS: Stop setting wgBabelUseDatabase, unused since Babel REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123419 (owner: 10Jforrester) [21:41:42] (03Merged) 10jenkins-bot: CS-labs: Stop setting wgUrlShortenerDB*, unused since UrlShortener REL1_41 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123420 (owner: 10Jforrester) [21:41:52] (03PS2) 10Scott French: shellbox-media: serve 1/8 of requests on 8.1 with more logging (2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124388 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [21:42:21] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1123417|IS: Stop setting wgParserConf, unused since MW 1.36]], [[gerrit:1123418|CS: Stop setting wgTmhWebPlayer, unused since TMH REL1_39]], [[gerrit:1123419|CS: Stop setting wgBabelUseDatabase, unused since Babel REL1_39]], [[gerrit:1123420|CS-labs: Stop setting wgUrlShortenerDB*, unused since UrlShortener REL1_41]], [[gerrit:1120505|[Growth] Enabl [21:42:21] e surfacing structured tasks A/B test (T385343)]] [21:42:23] T385343: Surfacing "Add a link" Structured Tasks: Experiment Release (FY24/25 WE1.2.9) - https://phabricator.wikimedia.org/T385343 [21:43:52] (03CR) 10Scott French: "Thanks again, Effie! As discussed earlier, I'll try to move this forward during my afternoon." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124388 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [21:44:11] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10603177 (10Ahoelzl) [21:45:18] !log jforrester@deploy2002 jforrester, sgimeno: Backport for [[gerrit:1123417|IS: Stop setting wgParserConf, unused since MW 1.36]], [[gerrit:1123418|CS: Stop setting wgTmhWebPlayer, unused since TMH REL1_39]], [[gerrit:1123419|CS: Stop setting wgBabelUseDatabase, unused since Babel REL1_39]], [[gerrit:1123420|CS-labs: Stop setting wgUrlShortenerDB*, unused since UrlShortener REL1_41]], [[gerrit:1120505|[Growth] Enable su [21:45:18] rfacing structured tasks A/B test (T385343)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:45:21] sergi0: Please check this last config thing. [21:45:29] checking [21:46:03] (03PS9) 10Bking: cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) [21:46:23] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [21:47:08] James_F: lgtm, from my side [21:47:10] !log jforrester@deploy2002 jforrester, sgimeno: Continuing with sync [21:47:27] Excellent, we might even give tgr_ a whole 8 minutes to test. :-( [21:48:16] thx :) [21:48:24] Sorry. [21:49:28] no worries, I can wait for the web team deploy and use the rest of that window [21:52:58] Ack. [21:53:53] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123417|IS: Stop setting wgParserConf, unused since MW 1.36]], [[gerrit:1123418|CS: Stop setting wgTmhWebPlayer, unused since TMH REL1_39]], [[gerrit:1123419|CS: Stop setting wgBabelUseDatabase, unused since Babel REL1_39]], [[gerrit:1123420|CS-labs: Stop setting wgUrlShortenerDB*, unused since UrlShortener REL1_41]], [[gerrit:1120505|[Growth] Enab [21:53:53] le surfacing structured tasks A/B test (T385343)]] (duration: 11m 31s) [21:53:55] T385343: Surfacing "Add a link" Structured Tasks: Experiment Release (FY24/25 WE1.2.9) - https://phabricator.wikimedia.org/T385343 [21:54:29] OK, done. [21:56:24] Thank you James_F! [21:57:46] Happy to help! [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250304T2200) [22:02:20] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:02:33] jan_drewniak: can you ping me when you are done? I have one more patch to deploy [22:02:49] tgr_: hey ok thanks! it shouldn't be long, just a config change [22:02:54] thx [22:03:20] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:07:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122684 (https://phabricator.wikimedia.org/T386849) (owner: 10Bernard Wang) [22:07:49] (03Merged) 10jenkins-bot: Deploy Search AB test to everywhere but English wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122684 (https://phabricator.wikimedia.org/T386849) (owner: 10Bernard Wang) [22:08:21] !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1122684|Deploy Search AB test to everywhere but English wiki (T386849)]] [22:08:24] T386849: Deploy empty search AB test to all wikis - https://phabricator.wikimedia.org/T386849 [22:11:20] !log jdrewniak@deploy2002 jdrewniak, bwang: Backport for [[gerrit:1122684|Deploy Search AB test to everywhere but English wiki (T386849)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:11:29] (03PS1) 10Cwhite: profile: add restbase scrape jobs to profile::prometheus::services [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) [22:12:04] (03PS2) 10Cwhite: profile: add restbase scrape jobs to profile::prometheus::services [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) [22:14:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3559 MB (3% inode=98%): /tmp 3559 MB (3% inode=98%): /var/tmp 3559 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [22:15:12] !log jdrewniak@deploy2002 jdrewniak, bwang: Continuing with sync [22:19:33] !log clearing user_real_name in group0 wikis (T387212) [22:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:37] T387212: Blank user_real_name in all Wikimedia wikis - https://phabricator.wikimedia.org/T387212 [22:21:56] !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122684|Deploy Search AB test to everywhere but English wiki (T386849)]] (duration: 13m 34s) [22:21:58] T386849: Deploy empty search AB test to all wikis - https://phabricator.wikimedia.org/T386849 [22:22:24] tgr_: ok I'm all done. [22:23:00] thanks! [22:25:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123807 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [22:26:10] (03Merged) 10jenkins-bot: CentralAuth: Enable SUL3 signup on group 0 (attempt 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123807 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [22:26:38] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1123807|CentralAuth: Enable SUL3 signup on group 0 (attempt 3) (T384007)]] [22:26:41] T384007: SUL3 Phase 1: All new account creation on group 0 wikis - https://phabricator.wikimedia.org/T384007 [22:29:36] !log tgr@deploy2002 tgr: Backport for [[gerrit:1123807|CentralAuth: Enable SUL3 signup on group 0 (attempt 3) (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:32:47] (03PS2) 10Bking: wdqs-categories: remove extraneous wgCirrusSearchCategoryEndpoint value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122151 (https://phabricator.wikimedia.org/T375520) [22:33:10] (03CR) 10Ryan Kemper: [C:03+1] wdqs-categories: remove extraneous wgCirrusSearchCategoryEndpoint value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122151 (https://phabricator.wikimedia.org/T375520) (owner: 10Bking) [22:33:21] (03CR) 10Cwhite: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto) [22:34:53] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:35:16] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:38:41] (03PS3) 10Ryan Kemper: wdqs-categories: remove extraneous wgCirrusSearchCategoryEndpoint value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122151 (https://phabricator.wikimedia.org/T375520) (owner: 10Bking) [22:38:41] (03PS1) 10Ryan Kemper: wdqs categories: switch to internal-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124535 (https://phabricator.wikimedia.org/T375520) [22:41:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10603349 (10Jhancock.wm) [22:41:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:50:30] !log tgr@deploy2002 Sync cancelled. [22:51:45] (03PS1) 10TrainBranchBot: Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124537 [22:51:45] (03CR) 10TrainBranchBot: "tgr@deploy2002 created a revert of this change as Ibc6fa5d19ef744f420bfb01ebee43990f144ce6d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123807 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [22:55:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124537 (owner: 10TrainBranchBot) [22:56:18] (03Merged) 10jenkins-bot: Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124537 (owner: 10TrainBranchBot) [22:56:48] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124537|Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 3)"]] [22:59:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10603419 (10phaultfinder) [22:59:51] !log tgr@deploy2002 trainbranchbot, tgr: Backport for [[gerrit:1124537|Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 3)"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:01:45] !log tgr@deploy2002 trainbranchbot, tgr: Continuing with sync [23:06:43] Looks like beta is broken: "Argument #1 ($title) must be of type Title, MediaWiki\Title\Title given, called in /srv/mediawiki/php-master/includes/HookContainer/HookContainer.php on line 155" [23:07:12] certainly caused by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1120610/ [23:08:24] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124537|Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 3)"]] (duration: 11m 36s) [23:11:01] !log UTC very late deploys done [23:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10603481 (10phaultfinder) [23:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:29:34] (03CR) 10Cwhite: "I *think* this is the right instance. LMK if we need a different instance to handle these scrapes." [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite) [23:30:47] (03PS1) 10Ebernhardson: cirrus: Provide an empty log4j-overrides.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124543 [23:30:53] (03CR) 10Cwhite: "> I *think* this is the right instance. LMK if we need a different instance to handle these scrapes." [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite) [23:31:35] (03PS2) 10Ebernhardson: cirrus: Provide an empty log4j-overrides.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124543 [23:33:44] (03CR) 10Ebernhardson: [C:03+2] cirrus: Provide an empty log4j-overrides.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124543 (owner: 10Ebernhardson) [23:34:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3227 MB (3% inode=98%): /tmp 3227 MB (3% inode=98%): /var/tmp 3227 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [23:35:16] (03Merged) 10jenkins-bot: cirrus: Provide an empty log4j-overrides.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124543 (owner: 10Ebernhardson) [23:40:25] (03PS5) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) [23:40:33] jouncebot: nowandnext [23:40:33] No deployments scheduled for the next 7 hour(s) and 19 minute(s) [23:40:33] In 7 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T0700) [23:40:38] (03PS6) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) [23:44:17] (03PS1) 10Ebernhardson: cirrus: Provide a specific log4j-overrides.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124545 [23:46:09] (03CR) 10Scott French: [C:03+2] "Doing so now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124388 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [23:47:30] (03Merged) 10jenkins-bot: shellbox-media: serve 1/8 of requests on 8.1 with more logging (2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124388 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [23:47:54] (03CR) 10Ebernhardson: [C:03+2] cirrus: Provide a specific log4j-overrides.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124545 (owner: 10Ebernhardson) [23:49:23] (03Merged) 10jenkins-bot: cirrus: Provide a specific log4j-overrides.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124545 (owner: 10Ebernhardson) [23:49:23] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [23:49:53] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [23:51:24] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [23:51:36] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [23:53:08] !log started shellbox-media PHP 8.1 pilot with increased logging and display_startup_errors fix - T377038 [23:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:11] T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038 [23:53:59] (03PS1) 10Ebernhardson: flink-app chart: Support per-chart logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546 [23:54:07] (03CR) 10CI reject: [V:04-1] flink-app chart: Support per-chart logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546 (owner: 10Ebernhardson) [23:54:21] (03PS2) 10Ebernhardson: flink-app chart: Support per-chart logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546 [23:57:39] (03PS3) 10Ebernhardson: flink-app chart: Support per-chart logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546