[00:06:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10886215 (10Dwisehaupt) Thanks. I'm not sure what's going on. Looking at the bios, it shows as connected and has a link speed of 10G. However, when I... [00:08:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1153742 [00:08:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1153742 (owner: 10TrainBranchBot) [00:10:57] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 648.95 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:29:43] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1153742 (owner: 10TrainBranchBot) [00:30:27] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:31:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:43:33] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:46:33] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:53:23] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [01:14:11] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#10886244 (10Ladsgroup) Okay, I made deeper investigation. I uploaded a rando... [01:22:21] (03PS1) 10Scott French: scap: block interactive maintenance scripts on mwmaint [puppet] - 10https://gerrit.wikimedia.org/r/1152820 [01:30:27] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:31:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:57] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:18:17] (03PS1) 10Jdrewniak: Revert "Deploy survey to en at twenty percent" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153750 [02:20:33] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (doc2003), Fresh: 144 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [02:21:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153750 (owner: 10Jdrewniak) [02:35:31] RECOVERY - snapshot of s8 in codfw on backupmon1001 is OK: Last snapshot for s8 at codfw (db2198) taken on 2025-06-05 01:39:23 (1089 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:39:45] RECOVERY - snapshot of s8 in eqiad on backupmon1001 is OK: Last snapshot for s8 at eqiad (db1171) taken on 2025-06-05 01:52:09 (1217 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:08:09] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#10886385 (10aaron) The idea of preloadFileStat() was to allow concurrent HEA... [03:13:33] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:16:33] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:25:04] (03PS1) 10Krinkle: deployment-prep: Add Apache vhost aliases for *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) [03:52:13] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [03:55:29] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:58:29] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:23:45] (03PS2) 10Krinkle: deployment-prep: Add Apache vhost aliases for *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) [04:37:15] (03PS3) 10Krinkle: deployment-prep: Add Apache vhost aliases for *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) [04:43:33] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:46:33] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:49:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10886436 (10phaultfinder) [04:53:23] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [04:57:32] (03PS2) 10Giuseppe Lavagetto: analytics::packages::common: include ua-parser regexes.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1153620 (https://phabricator.wikimedia.org/T394794) [04:58:41] (03PS7) 10Giuseppe Lavagetto: cache::haproxy: remove unused variables from configuration [puppet] - 10https://gerrit.wikimedia.org/r/1152083 [04:59:44] (03CR) 10CI reject: [V:04-1] analytics::packages::common: include ua-parser regexes.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1153620 (https://phabricator.wikimedia.org/T394794) (owner: 10Giuseppe Lavagetto) [05:03:31] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: remove unused variables from configuration [puppet] - 10https://gerrit.wikimedia.org/r/1152083 (owner: 10Giuseppe Lavagetto) [05:05:06] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: remove post_acl_actions and sticktable variables [puppet] - 10https://gerrit.wikimedia.org/r/1152288 (owner: 10Giuseppe Lavagetto) [05:05:12] (03PS6) 10Giuseppe Lavagetto: cache::haproxy: remove post_acl_actions and sticktable variables [puppet] - 10https://gerrit.wikimedia.org/r/1152288 [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:18:11] (03PS7) 10Marostegui: ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673 [05:18:11] (03PS1) 10Marostegui: db2180: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153777 (https://phabricator.wikimedia.org/T395989) [05:20:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2180 T395989', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json [05:20:07] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [05:20:31] (03CR) 10CI reject: [V:04-1] ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui) [05:20:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance [05:21:45] (03PS1) 10Marostegui: db2180: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153778 (https://phabricator.wikimedia.org/T395989) [05:22:18] (03Abandoned) 10Marostegui: db2180: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153777 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [05:22:25] (03CR) 10Marostegui: [C:03+2] db2180: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153778 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [05:23:50] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: remove post_acl_actions and sticktable variables [puppet] - 10https://gerrit.wikimedia.org/r/1152288 (owner: 10Giuseppe Lavagetto) [05:24:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc2 T395983', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json [05:24:47] T395983: Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata - https://phabricator.wikimedia.org/T395983 [05:25:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance [05:25:14] !log Change datadir on pc2 dbmaint eqiad codfw T395983 [05:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json [05:29:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc2 T395983', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json [05:33:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc3 T395983', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json [05:33:21] T395983: Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata - https://phabricator.wikimedia.org/T395983 [05:33:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance [05:36:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc3 T395983', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json [05:39:05] (03CR) 10Effie Mouzeli: [C:03+1] "Thank you, this is great!!!" [puppet] - 10https://gerrit.wikimedia.org/r/1152820 (owner: 10Scott French) [05:41:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json [05:43:33] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:46:33] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:47:50] !log Change datadir on pc3 dbmaint eqiad codfw T395983 [05:47:52] !log Change datadir on pc4 dbmaint eqiad codfw T395983 [05:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:54] T395983: Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata - https://phabricator.wikimedia.org/T395983 [05:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc4 T395983', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json [05:48:13] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance [05:50:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc4 T395983', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json [05:51:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc5 T395983', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json [05:51:31] !log Change datadir on pc5 dbmaint eqiad codfw T395983 [05:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance [05:53:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc5 T395983', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json [05:53:53] T395983: Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata - https://phabricator.wikimedia.org/T395983 [05:54:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc6 T395983', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json [05:54:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance [05:55:17] PROBLEM - Disk space on an-worker1105 is CRITICAL: DISK CRITICAL - free space: / 2119 MB (3% inode=95%): /tmp 2119 MB (3% inode=95%): /var/tmp 2119 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1105&var-datasource=eqiad+prometheus/ops [05:56:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json [05:56:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc6 T395983', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json [05:57:02] !log Change datadir on pc6 dbmaint eqiad codfw T395983 [05:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T0600) [06:00:05] marostegui, Amir1, and federico3: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T0600). nyaa~ [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:02:58] FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:03:58] <_joe_> I am 10 minutes away from my pc [06:04:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 3 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [06:07:58] RESOLVED: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:09:30] RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 3 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [06:11:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json [06:11:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance [06:12:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc7 T395983', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json [06:12:03] T395983: Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata - https://phabricator.wikimedia.org/T395983 [06:15:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc7 T395983', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json [06:15:07] !log Change datadir on pc7 dbmaint eqiad codfw T395983 [06:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance [06:16:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc8 T395983', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json [06:19:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc8 T395983', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json [06:19:33] T395983: Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata - https://phabricator.wikimedia.org/T395983 [06:19:37] !log Change datadir on pc8 dbmaint eqiad codfw T395983 [06:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:54] (03PS1) 10Marostegui: parsercache.pp: Change datadir [puppet] - 10https://gerrit.wikimedia.org/r/1153783 (https://phabricator.wikimedia.org/T395983) [06:24:26] (03CR) 10Marostegui: [C:03+2] parsercache.pp: Change datadir [puppet] - 10https://gerrit.wikimedia.org/r/1153783 (https://phabricator.wikimedia.org/T395983) (owner: 10Marostegui) [06:26:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json [06:27:34] (03CR) 10Reedy: "CU shouldn’t be enabled.." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153674 (https://phabricator.wikimedia.org/T396061) (owner: 10Lucas Werkmeister) [06:41:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json [06:54:29] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:57:29] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:00:05] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T0700). Please do the needful. [07:00:05] georgekyz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:30] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153265 (https://phabricator.wikimedia.org/T395887) (owner: 10Dzahn) [07:03:06] (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1153265 (https://phabricator.wikimedia.org/T395887) (owner: 10Dzahn) [07:06:05] Hey folks, we are going to deploy the revertrisk filters for multiple wikis in this patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1152682 [07:13:06] (03PS6) 10Ilias Sarantopoulos: ores-extension: enable extension with revertrisk filter for second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [07:13:33] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:16:33] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:16:54] (03PS7) 10Ilias Sarantopoulos: ores-extension: enable extension with revertrisk filter for second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [07:17:36] (03CR) 10Gkyziridis: [C:03+1] "LGTM! Thnx for working on this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [07:18:47] Deploying with spiderpig: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1152682 [07:19:02] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet [07:19:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [07:20:10] (03Merged) 10jenkins-bot: ores-extension: enable extension with revertrisk filter for second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [07:20:59] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1152682|ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] [07:21:05] T395823: [batch #2] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395823 [07:22:07] RECOVERY - snapshot of x3 in codfw on backupmon1001 is OK: Last snapshot for x3 at codfw (db2200) taken on 2025-06-05 06:48:49 (334 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:22:45] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet [07:22:47] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [07:23:25] !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1152682|ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:26:04] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet [07:26:35] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" [07:26:40] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" [07:26:40] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:26:40] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors [07:26:43] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors [07:27:01] (03CR) 10ArielGlenn: Use GetSecurityLogContext hook for goodpass/badpass logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153626 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [07:27:14] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" [07:27:19] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" [07:30:28] jmm@cumin1003 makevm (PID 275873) is awaiting input [07:32:23] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet [07:32:30] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet [07:36:25] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [07:36:53] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm [07:38:51] !log gkyziridis@deploy1003 Sync cancelled. [07:39:32] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1153700 (https://phabricator.wikimedia.org/T395167) (owner: 10Dzahn) [07:40:56] I cancelled the sync because the extension was not enabled during the check via wikimediaDebug addon for azwiki. So, I will update the patch and remove azwiki and will deploy the rest wikis. [07:42:39] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete jobrunner cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1149398 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [07:43:02] (03PS1) 10Fabfur: placeholder [puppet] - 10https://gerrit.wikimedia.org/r/1153932 [07:43:31] FYI, aux-k8s-etcd2003 will go down for a reboot [07:43:36] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet [07:43:51] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [07:44:04] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [07:45:20] (03PS1) 10Gkyziridis: Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153937 [07:46:03] PROBLEM - Host aux-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [07:46:33] (03CR) 10Alexandros Kosiaris: [C:03+1] scap: block interactive maintenance scripts on mwmaint [puppet] - 10https://gerrit.wikimedia.org/r/1152820 (owner: 10Scott French) [07:46:39] (03CR) 10Ilias Sarantopoulos: [C:03+2] Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153937 (owner: 10Gkyziridis) [07:47:26] (03Merged) 10jenkins-bot: Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153937 (owner: 10Gkyziridis) [07:49:44] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet [07:50:13] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [07:50:31] RECOVERY - Host aux-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.70 ms [07:52:13] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [07:56:14] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet [07:57:38] (03CR) 10Gmodena: [C:03+2] mw-content-history-reconcile-enrich/mw-content-history-reconcile-enrich-next: +RAM for jobMgr [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153715 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [07:59:02] (03Merged) 10jenkins-bot: mw-content-history-reconcile-enrich/mw-content-history-reconcile-enrich-next: +RAM for jobMgr [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153715 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [07:59:06] me and georgekyz ended up cancelling the deployment via spiderpig and will reschedule it for another window [08:00:23] we cancelled while we were cehcking with MediawikiDebug and we reverted the patch on mediawiki-config https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1153937 . shall we revert it manually under /srv/mediawiki-staging/wmf-config? [08:00:50] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [08:00:53] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [08:03:03] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [08:03:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T395241)', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json [08:03:19] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [08:03:56] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage [08:04:42] urbanecm: o/ could you help with the above ?(if you're here) [08:06:44] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage [08:09:38] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet [08:10:05] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet [08:10:15] I see the revert is there now, so probably nothing else left to do [08:11:05] jouncebot: nowandnext [08:11:05] No deployments scheduled for the next 1 hour(s) and 48 minute(s) [08:11:05] In 1 hour(s) and 48 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T1000) [08:12:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T395241)', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json [08:15:39] (03PS1) 10Jelto: gitlab: enable object storage for gitlab-artifacts in production [puppet] - 10https://gerrit.wikimedia.org/r/1153942 (https://phabricator.wikimedia.org/T378922) [08:17:14] (03PS2) 10Majavah: P:openstack: Migrate simple rules to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1150685 [08:17:14] (03PS2) 10Majavah: P:openstack: pdns: Migrate mysql_root ferm service to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1150686 [08:17:14] (03PS2) 10Majavah: P:openstack: codfw1dev: Migrate Cumin ferm term to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1150687 [08:17:54] (03CR) 10Arnaudb: [C:03+1] "looks good to me, thanks for the detailed commit message" [puppet] - 10https://gerrit.wikimedia.org/r/1153942 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:19:53] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1153942 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:24:07] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1150685 (owner: 10Majavah) [08:24:23] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:24:34] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:24:50] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm [08:24:50] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet [08:25:01] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: Migrate simple rules to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1150685 (owner: 10Majavah) [08:25:37] (03PS1) 10Gkyziridis: ores-extension: enable extension with revertrisk filter for second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153945 (https://phabricator.wikimedia.org/T395823) [08:27:21] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: enable object storage for gitlab-artifacts in production [puppet] - 10https://gerrit.wikimedia.org/r/1153942 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:28:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json [08:28:50] (03CR) 10Effie Mouzeli: [C:03+2] mw-experimental: initial commit (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150760 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [08:30:16] (03Merged) 10jenkins-bot: mw-experimental: initial commit (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150760 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [08:31:47] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [08:32:28] (03PS1) 10Muehlenhoff: Assign ncredir role to ncredir7003 [puppet] - 10https://gerrit.wikimedia.org/r/1153947 (https://phabricator.wikimedia.org/T394263) [08:32:29] (03PS1) 10Muehlenhoff: Add ncredir7003 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1153948 (https://phabricator.wikimedia.org/T394263) [08:32:42] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:32:53] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:33:14] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1150687 (owner: 10Majavah) [08:34:08] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5766/co" [puppet] - 10https://gerrit.wikimedia.org/r/1150686 (owner: 10Majavah) [08:34:49] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:35:01] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:35:01] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: pdns: Migrate mysql_root ferm service to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1150686 (owner: 10Majavah) [08:35:15] (03CR) 10Majavah: [C:03+2] P:openstack: codfw1dev: Migrate Cumin ferm term to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1150687 (owner: 10Majavah) [08:35:47] (03CR) 10Ilias Sarantopoulos: ores-extension: enable extension with revertrisk filter for second batch of wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153945 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [08:37:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1150686 (owner: 10Majavah) [08:37:53] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki on wikikube-worker2100 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [08:39:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153947 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [08:39:46] jmm@cumin1003 drain-node (PID 283929) is awaiting input [08:43:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json [08:43:48] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet [08:44:07] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [08:45:47] (03PS1) 10Marostegui: db2169: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153951 (https://phabricator.wikimedia.org/T395989) [08:45:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2169 T395989', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json [08:46:01] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [08:46:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance [08:46:33] (03CR) 10Marostegui: [C:03+2] db2169: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153951 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [08:47:33] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet [08:47:42] (03PS1) 10Tiziano Fogli: ircecho3.py: fix debug output [puppet] - 10https://gerrit.wikimedia.org/r/1153952 (https://phabricator.wikimedia.org/T389937) [08:50:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json [08:50:02] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [08:50:36] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet [08:50:51] (03PS2) 10Gkyziridis: ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153945 (https://phabricator.wikimedia.org/T395823) [08:51:23] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153945 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [08:53:09] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet [08:53:23] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [08:53:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153945 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [08:58:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T395241)', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json [08:58:30] (03PS1) 10Marostegui: mariadb: Give parsercache role to msX [puppet] - 10https://gerrit.wikimedia.org/r/1153955 [08:58:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [08:58:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T395241)', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json [09:00:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10886841 (10cmooney) > Check with @cmooney for changes required to hieradata/common/lvs/interfaces.yaml (to add lvs1016 there) and also to hieradata/role/eqiad/lvs/balancer.yam... [09:02:11] (03PS2) 10Marostegui: mariadb: Give parsercache role to msX [puppet] - 10https://gerrit.wikimedia.org/r/1153955 [09:02:28] jmm@cumin1003 drain-node (PID 286545) is awaiting input [09:05:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json [09:07:26] (03PS3) 10Marostegui: mariadb: Give parsercache role to msX [puppet] - 10https://gerrit.wikimedia.org/r/1153955 [09:07:39] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:07:50] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:08:03] (03CR) 10CI reject: [V:04-1] mariadb: Give parsercache role to msX [puppet] - 10https://gerrit.wikimedia.org/r/1153955 (owner: 10Marostegui) [09:08:25] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T395241)', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json [09:12:26] (03PS4) 10Marostegui: mariadb: Give parsercache role to msX [puppet] - 10https://gerrit.wikimedia.org/r/1153955 [09:12:52] (03PS1) 10Vgutierrez: varnish: Set SameSite=None for wmfuniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1153957 (https://phabricator.wikimedia.org/T395958) [09:13:24] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet [09:14:53] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153957 (https://phabricator.wikimedia.org/T395958) (owner: 10Vgutierrez) [09:15:17] PROBLEM - Disk space on an-worker1105 is CRITICAL: DISK CRITICAL - free space: / 2119 MB (3% inode=95%): /tmp 2119 MB (3% inode=95%): /var/tmp 2119 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1105&var-datasource=eqiad+prometheus/ops [09:16:42] (03CR) 10Marostegui: "I think this is what we want: https://puppet-compiler.wmflabs.org/output/1153955/5770/db1152.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1153955 (owner: 10Marostegui) [09:17:11] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet [09:18:16] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [09:18:31] (03CR) 10Vgutierrez: [C:03+2] varnish: Set SameSite=None for wmfuniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1153957 (https://phabricator.wikimedia.org/T395958) (owner: 10Vgutierrez) [09:20:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json [09:20:19] (03PS1) 10Muehlenhoff: Extend kafka firewall config for netflow7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153958 (https://phabricator.wikimedia.org/T394263) [09:22:42] (03PS11) 10Effie Mouzeli: mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [09:23:14] (03Abandoned) 10Marostegui: ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui) [09:23:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json [09:23:45] (03PS1) 10Muehlenhoff: Assign netinsights role to netflow7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153959 (https://phabricator.wikimedia.org/T394263) [09:24:08] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [09:24:26] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" [09:24:38] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet [09:25:21] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" [09:26:20] (03CR) 10Effie Mouzeli: mw-experimental: create new service #6 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [09:26:47] (03PS12) 10Effie Mouzeli: mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [09:26:56] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [09:27:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153959 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:29:23] (03CR) 10Alexandros Kosiaris: [C:04-1] mw-experimental: create new service #6 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [09:29:51] (03CR) 10Hnowlan: [C:03+2] trafficserver: restbaseless reading lists API for ~group1 [puppet] - 10https://gerrit.wikimedia.org/r/1149624 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [09:30:17] (03PS1) 10Tiziano Fogli: prom7002: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1153962 (https://phabricator.wikimedia.org/T395130) [09:30:35] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [09:30:58] (03PS1) 10Marostegui: db2158: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153963 (https://phabricator.wikimedia.org/T395989) [09:31:00] (03PS13) 10Effie Mouzeli: mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [09:31:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2158 T395989', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json [09:31:10] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [09:31:53] (03CR) 10Effie Mouzeli: mw-experimental: create new service #6 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [09:31:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance [09:32:21] (03CR) 10Marostegui: [C:03+2] db2158: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1153963 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [09:32:35] PROBLEM - Host kubestagemaster2005 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:51] RECOVERY - snapshot of x3 in eqiad on backupmon1001 is OK: Last snapshot for x3 at eqiad (db1216) taken on 2025-06-05 09:03:03 (273 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:35:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json [09:35:33] RECOVERY - Host kubestagemaster2005 is UP: PING OK - Packet loss = 0%, RTA = 30.51 ms [09:35:53] !log Migrate reading lists API out of restbase for group1 via rest-gateway [09:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:33] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:36:33] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [09:36:41] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [09:36:45] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:36:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json [09:38:19] (03PS2) 10Muehlenhoff: Assign netinsights role to netflow7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153959 (https://phabricator.wikimedia.org/T394263) [09:38:33] (03CR) 10Stevemunene: [C:03+2] hdfs: Exclude group 7 and 8 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153560 (https://phabricator.wikimedia.org/T390174) (owner: 10Stevemunene) [09:38:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json [09:39:03] (03PS1) 10Clément Goubert: mw::maintenance: Delete old captchas [puppet] - 10https://gerrit.wikimedia.org/r/1153964 (https://phabricator.wikimedia.org/T388531) [09:39:22] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10886950 (10Jelto) The artifact upload issues have been resolved (T396018). CI job logs and metric look normal. So I'll trigger th... [09:40:53] (03PS1) 10Majavah: Fix typo [software/bitu] - 10https://gerrit.wikimedia.org/r/1153965 [09:41:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153959 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:41:27] (03CR) 10Clément Goubert: [C:03+2] mw-cron: Enable limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153602 (https://phabricator.wikimedia.org/T395436) (owner: 10Clément Goubert) [09:42:50] (03Merged) 10jenkins-bot: mw-cron: Enable limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153602 (https://phabricator.wikimedia.org/T395436) (owner: 10Clément Goubert) [09:43:02] (03CR) 10Cathal Mooney: [C:03+1] Extend kafka firewall config for netflow7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153958 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:43:07] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [09:43:27] RECOVERY - Hadoop DataNode on an-worker1163 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [09:43:32] !log Re-enabling CPU/RAM limits on mw-cron - T395436 [09:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:36] T395436: Limit CPU usage for mw-on-k8s cli deployments - https://phabricator.wikimedia.org/T395436 [09:44:07] (03PS2) 10Muehlenhoff: Extend kafka firewall config for netflow7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153958 (https://phabricator.wikimedia.org/T394263) [09:44:16] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [09:44:42] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [09:45:02] (03CR) 10Cathal Mooney: [C:03+1] Assign netinsights role to netflow7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153959 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:46:05] isaranto: I think you left the release in a weird state, there's a pending image update for all mw-on-k8s that corresponds to your change this morning [09:46:29] (03CR) 10Vgutierrez: [C:03+1] Assign ncredir role to ncredir7003 [puppet] - 10https://gerrit.wikimedia.org/r/1153947 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:47:21] y'all need to revert and deploy the revert I think [09:47:41] (03CR) 10Gergő Tisza: "[This code block](https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/refs/changes/63/1153363/2/wmf-config/logging.php#311) is se" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) (owner: 10Gergő Tisza) [09:48:39] jnuche: you around for advice pls? [09:50:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json [09:50:35] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet [09:50:54] jmm@cumin1003 drain-node (PID 290295) is awaiting input [09:51:12] (03PS3) 10Gergő Tisza: Use GetSecurityLogContext hook for goodpass/badpass logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153626 (https://phabricator.wikimedia.org/T395204) [09:51:34] (03CR) 10Gergő Tisza: Use GetSecurityLogContext hook for goodpass/badpass logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153626 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [09:51:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json [09:52:05] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [09:52:28] (03CR) 10Muehlenhoff: [C:03+2] Extend kafka firewall config for netflow7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153958 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:53:40] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [09:53:44] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [09:53:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T395241)', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json [09:54:01] (03CR) 10Btullis: [C:03+1] Extend kafka firewall config for netflow7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153958 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:54:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [09:54:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T395241)', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json [09:57:04] isaranto: I'm backporting the revert. [09:57:22] (03CR) 10Muehlenhoff: [C:03+2] profile::memcached::instance: Add support for nftables-compatible config [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff) [09:57:47] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1153937|Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] [09:58:02] claime: in a meeting. thanks for doing that. Is this what we should do in these cases? the previous backport wasn't actually deployed as the deployment was cancelled [09:58:12] I mean: shall we always deploy the revert? [09:58:12] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [09:58:14] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet [09:58:19] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [09:58:22] isaranto: yes, because once the image is built, the release files are updated on the deployment server [09:58:44] isaranto: so you should always revert then backport the revert to restore the correct state [09:58:51] ack , thank youu . georgekyz --^ [10:00:03] !log cgoubert@deploy1003 gkyziridis, cgoubert: Backport for [[gerrit:1153937|Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T1000) [10:00:25] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet [10:01:12] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [10:01:17] !log cgoubert@deploy1003 gkyziridis, cgoubert: Continuing with sync [10:03:48] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [10:03:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T395241)', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json [10:03:57] !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser [10:04:14] !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) [10:04:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2043 es2046 T395241', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json [10:04:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance [10:05:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json [10:06:55] claime: Thnx for deploying the revert and thnx for sharing this info. We know from now on. Thnx again [10:07:00] (03PS1) 10Muehlenhoff: cloudcontrol/codfw1dev: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1153970 [10:07:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json [10:07:22] georgekyz: np, I filed https://phabricator.wikimedia.org/T396106 to make more explicit what needs to be done [10:08:07] claime: very much appreciated! [10:08:13] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet [10:08:24] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153937|Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) [10:08:39] (03Abandoned) 10Majavah: systemd: Do not try to validate overrides [puppet] - 10https://gerrit.wikimedia.org/r/1146998 (owner: 10Majavah) [10:08:50] (03CR) 10Majavah: [C:03+2] P:wmcs::instance: Drop unneeded syslog overrides [puppet] - 10https://gerrit.wikimedia.org/r/1142534 (owner: 10Majavah) [10:09:12] (03CR) 10Kamila Součková: [C:03+1] "wheeeeeeeee" [puppet] - 10https://gerrit.wikimedia.org/r/1153964 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [10:09:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json [10:09:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json [10:09:57] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [10:10:53] jmm@cumin1003 drain-node (PID 292796) is awaiting input [10:11:02] (03PS1) 10Clément Goubert: shellbox-constraints: Actually bump resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153973 [10:11:51] RECOVERY - Hadoop NodeManager on an-worker1163 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:12:03] (03PS2) 10Clément Goubert: mw::maintenance: Delete old captchas [puppet] - 10https://gerrit.wikimedia.org/r/1153964 (https://phabricator.wikimedia.org/T388531) [10:13:10] FIRING: BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:13:15] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:13:19] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:13:27] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:13:27] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:13:33] PROBLEM - Hadoop NodeManager on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:13:51] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:14:33] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:14:46] (03CR) 10Clément Goubert: [C:03+2] "PS2 is only a comment removal, PS1 was +1'd, merging." [puppet] - 10https://gerrit.wikimedia.org/r/1153964 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [10:15:19] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:15:27] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:16:12] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet [10:16:17] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [10:16:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153970 (owner: 10Muehlenhoff) [10:16:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10887039 (10ops-monitoring-bot) Host an-worker1163.eqiad.wmnet rebooted by stevemunene@cumin1002 w... [10:16:42] FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:17:00] (03CR) 10Effie Mouzeli: [C:03+2] mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [10:18:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:18:19] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:18:23] (03Merged) 10jenkins-bot: mw-experimental: create new service #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [10:18:27] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:18:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json [10:21:59] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [10:22:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json [10:22:24] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet [10:22:31] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [10:23:12] (03PS1) 10JMeybohm: Use Wikimedia DNS IPs as mock [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153975 [10:23:12] (03PS1) 10JMeybohm: calico: Add support to manage CNI installation by daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 [10:23:12] (03PS1) 10JMeybohm: coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 [10:23:12] (03PS1) 10JMeybohm: cfssl-issuer: Allow to provide a custom CA certificate store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 [10:23:14] (03PS1) 10JMeybohm: admin_ng: Split envoyfilters installation into a separate release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153979 [10:23:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json [10:23:55] (03CR) 10Ayounsi: "Lets not have both gnmic running at the same time otherwise we might have funky data in Prometheus. Or maybe it will be fine but we need t" [puppet] - 10https://gerrit.wikimedia.org/r/1153959 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [10:23:56] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet [10:24:15] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet [10:24:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10887056 (10ops-monitoring-bot) Host an-worker1164.eqiad.wmnet rebooted by stevemunene@cumin1002 w... [10:24:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json [10:24:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json [10:25:29] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:25:42] (03PS1) 10Muehlenhoff: memcached: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1153981 (https://phabricator.wikimedia.org/T371881) [10:26:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153981 (https://phabricator.wikimedia.org/T371881) (owner: 10Muehlenhoff) [10:26:22] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:26:44] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:27:29] claime, isaranto: sorry about that, I had just left for lunch and didn't see the ping [10:27:39] jnuche: all good [10:27:43] backporting the revert was indeed the right fix, thanks for doing that [10:27:45] !log Manual run of generatecaptcha on mw-cron with delete - T388531 [10:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:48] T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531 [10:28:32] claime: jnuche lesson learned!thank you for the help. We've also shared it with the rest of the team so that folks are aware [10:30:06] !log Ran fixStuckGlobalRename.php for T396054 [10:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:09] T396054: Unblock stuck global rename of Renamed user 5f7280e72219276c1352eb80f69489b0 - https://phabricator.wikimedia.org/T396054 [10:30:25] PROBLEM - Host centrallog2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:31:02] (03PS4) 10Giuseppe Lavagetto: cache::haproxy: fully set x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) [10:31:07] (03CR) 10Gergő Tisza: [C:03+1] SUL3: Enable client hints data on the auth shared domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [10:31:23] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [10:31:27] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:31:31] RECOVERY - Host centrallog2002 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms [10:31:37] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet [10:31:40] (03PS2) 10JMeybohm: Use Wikimedia DNS IPs as mock [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153975 (https://phabricator.wikimedia.org/T396107) [10:31:41] (03PS2) 10JMeybohm: calico: Add support to manage CNI installation by daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 (https://phabricator.wikimedia.org/T396107) [10:31:43] (03PS2) 10JMeybohm: coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) [10:31:45] (03PS2) 10JMeybohm: cfssl-issuer: Allow to provide a custom CA certificate store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107) [10:31:48] (03PS2) 10JMeybohm: admin_ng: Split envoyfilters installation into a separate release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153979 (https://phabricator.wikimedia.org/T396107) [10:32:00] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet [10:32:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10887113 (10ops-monitoring-bot) Host an-worker1165.eqiad.wmnet rebooted by stevemunene@cumin1002 w... [10:32:41] PROBLEM - Check if anycast-healthchecker and all configured threads are running on centrallog2002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [10:33:02] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [10:33:15] PROBLEM - Bird Internet Routing Daemon on centrallog2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:33:33] RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:34:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json [10:34:15] RECOVERY - Bird Internet Routing Daemon on centrallog2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:34:27] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:34:39] RECOVERY - Check if anycast-healthchecker and all configured threads are running on centrallog2002 is OK: OK: UP (pid=4107) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [10:34:42] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [10:36:12] (03PS3) 10JMeybohm: admin_ng: Split envoyfilters installation into a separate release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153979 (https://phabricator.wikimedia.org/T389080) [10:36:15] (03PS1) 10JMeybohm: admin_ng: Fix dependencies/needs of helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153982 (https://phabricator.wikimedia.org/T389080) [10:36:33] PROBLEM - Hadoop NodeManager on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:36:42] RESOLVED: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:37:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json [10:38:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:38:15] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:39:34] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet [10:39:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json [10:40:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json [10:41:15] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:41:34] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet [10:42:04] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet [10:46:21] (03CR) 10Fabfur: [C:03+1] cache::haproxy: fully set x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [10:48:28] (03CR) 10Vgutierrez: [C:04-1] cache::haproxy: fully set x-provenance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [10:49:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T395241)', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json [10:49:21] (03CR) 10Muehlenhoff: "I can disable Puppet on netflow7001 and stop gnmic.service when rolling this out? This should avoid double-reporting. And if 7002 is confi" [puppet] - 10https://gerrit.wikimedia.org/r/1153959 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [10:49:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [10:49:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T395241)', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json [10:51:51] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:52:00] jmm@cumin1003 drain-node (PID 295932) is awaiting input [10:52:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json [10:54:51] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:54:54] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1153962 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [10:55:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json [10:55:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json [10:55:25] (03CR) 10Muehlenhoff: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1153959 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [10:56:01] (03CR) 10Muehlenhoff: [C:03+1] "Thanks!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1153965 (owner: 10Majavah) [10:56:03] (03CR) 10Muehlenhoff: [C:03+2] Fix typo [software/bitu] - 10https://gerrit.wikimedia.org/r/1153965 (owner: 10Majavah) [10:56:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T395241)', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json [10:56:56] (03PS1) 10Vgutierrez: Revert "liberica: Don't deploy ipip-multiqueue-optimizer with katran" [puppet] - 10https://gerrit.wikimedia.org/r/1153985 [10:57:30] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [10:57:35] (03PS2) 10Vgutierrez: Revert "liberica: Don't deploy ipip-multiqueue-optimizer with katran" [puppet] - 10https://gerrit.wikimedia.org/r/1153985 (https://phabricator.wikimedia.org/T380450) [10:58:02] (03PS3) 10Vgutierrez: Revert "liberica: Don't deploy ipip-multiqueue-optimizer with katran" [puppet] - 10https://gerrit.wikimedia.org/r/1153985 (https://phabricator.wikimedia.org/T380450) [10:59:48] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153985 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez) [11:00:27] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:02:06] (03CR) 10Fabfur: [C:03+1] Revert "liberica: Don't deploy ipip-multiqueue-optimizer with katran" [puppet] - 10https://gerrit.wikimedia.org/r/1153985 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez) [11:02:09] !log restarting Blazegraph on wdqs1023 to address allocator decreasing alert [11:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:27] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:03:43] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet [11:03:55] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) [11:04:13] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet [11:04:36] ryankemper, inflatador: ^^ (blazegraph restart) [11:08:15] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:09:08] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet [11:10:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json [11:10:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json [11:11:15] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:11:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json [11:15:19] jmm@cumin1003 drain-node (PID 298674) is awaiting input [11:19:59] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet [11:20:43] (03CR) 10Ayounsi: "Yep that works, you can sync up with Cathal or me (tomorrow) if needed." [puppet] - 10https://gerrit.wikimedia.org/r/1153959 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:22:07] PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:25:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json [11:25:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json [11:25:37] RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms [11:25:55] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet [11:26:02] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet [11:26:44] (03PS1) 10Muehlenhoff: Add netflow7002 [homer/public] - 10https://gerrit.wikimedia.org/r/1153993 (https://phabricator.wikimedia.org/T394263) [11:27:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json [11:28:06] (03CR) 10Majavah: [C:03+1] "The PCC looks a bit scary with the default value for `srange`, but I think this is doing what it's supposed to." [puppet] - 10https://gerrit.wikimedia.org/r/1153970 (owner: 10Muehlenhoff) [11:30:27] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:33:27] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:35:15] !log installing Linux 5.10.237 on Bullseye hosts [11:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:15] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:38:35] (03CR) 10Cathal Mooney: [C:03+1] Add netflow7002 [homer/public] - 10https://gerrit.wikimedia.org/r/1153993 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:41:15] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T395241)', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json [11:42:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance [11:43:11] PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - free space: / 2095 MB (3% inode=95%): /tmp 2095 MB (3% inode=95%): /var/tmp 2095 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops [11:45:19] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:47:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance [11:47:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T395241)', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json [11:47:49] (03PS1) 10Effie Mouzeli: mw-experimental-mediawiki-image-update: improve script [puppet] - 10https://gerrit.wikimedia.org/r/1153999 [11:48:19] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:48:46] (03PS2) 10Effie Mouzeli: mw-experimental-mediawiki-image-update: improve script [puppet] - 10https://gerrit.wikimedia.org/r/1153999 [11:48:53] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:48:56] (03CR) 10Tiziano Fogli: [C:03+2] prom7002: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1153962 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [11:49:03] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:52:13] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [11:52:31] PROBLEM - Disk space on an-worker1109 is CRITICAL: DISK CRITICAL - free space: / 1817 MB (3% inode=95%): /tmp 1817 MB (3% inode=95%): /var/tmp 1817 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops [11:55:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T395241)', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json [11:56:18] (03PS2) 10Federico Ceratto: pool.py: bugfix: remove diff check [cookbooks] - 10https://gerrit.wikimedia.org/r/1153989 (https://phabricator.wikimedia.org/T383760) [11:56:19] (03CR) 10Federico Ceratto: "As discussed on IRC" [cookbooks] - 10https://gerrit.wikimedia.org/r/1153989 (https://phabricator.wikimedia.org/T383760) (owner: 10Federico Ceratto) [11:57:06] (03CR) 10Marostegui: "Isn't this the same as https://gerrit.wikimedia.org/r/1153671 ?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1153989 (https://phabricator.wikimedia.org/T383760) (owner: 10Federico Ceratto) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T1200) [12:07:11] (03CR) 10Muehlenhoff: [C:03+2] Add netflow7002 [homer/public] - 10https://gerrit.wikimedia.org/r/1153993 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [12:08:15] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:10:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json [12:11:15] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:15:19] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:18:19] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:20:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2042 es2045 T395241', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json [12:21:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance [12:21:12] (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154006 (https://phabricator.wikimedia.org/T391264) [12:21:51] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:22:58] (03CR) 10Dima koushha: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154006 (https://phabricator.wikimedia.org/T391264) (owner: 10Jakob) [12:23:15] (03CR) 10Ladsgroup: [C:03+1] mariadb: Give parsercache role to msX [puppet] - 10https://gerrit.wikimedia.org/r/1153955 (owner: 10Marostegui) [12:24:51] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:25:21] (03CR) 10Jakob: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154006 (https://phabricator.wikimedia.org/T391264) (owner: 10Jakob) [12:25:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json [12:26:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json [12:26:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json [12:26:38] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154010 [12:27:05] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154006 (https://phabricator.wikimedia.org/T391264) (owner: 10Jakob) [12:27:27] (03CR) 10Marostegui: [C:03+2] mariadb: Give parsercache role to msX [puppet] - 10https://gerrit.wikimedia.org/r/1153955 (owner: 10Marostegui) [12:30:12] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [12:30:19] (03PS1) 10Marostegui: ms2: Move hosts to parsercache role [puppet] - 10https://gerrit.wikimedia.org/r/1154011 [12:30:27] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [12:30:54] !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [12:31:10] !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [12:31:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [12:31:59] (03CR) 10Marostegui: [C:03+2] ms2: Move hosts to parsercache role [puppet] - 10https://gerrit.wikimedia.org/r/1154011 (owner: 10Marostegui) [12:32:02] !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [12:32:18] !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [12:34:31] (03PS1) 10Marostegui: production-parsercache.sql.erb: Add mainstash database [puppet] - 10https://gerrit.wikimedia.org/r/1154013 [12:35:08] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10887537 (10Jelto) All CI artifacts have been successfully migrated to object storage. The overall disk usage on the GitLab host ha... [12:36:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [12:36:34] (03CR) 10Marostegui: "This is a NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1154013 (owner: 10Marostegui) [12:36:52] (03CR) 10Marostegui: [C:03+2] production-parsercache.sql.erb: Add mainstash database [puppet] - 10https://gerrit.wikimedia.org/r/1154013 (owner: 10Marostegui) [12:37:33] (03CR) 10Muehlenhoff: [C:03+2] Assign netinsights role to netflow7002 [puppet] - 10https://gerrit.wikimedia.org/r/1153959 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [12:38:15] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:40:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T395241)', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json [12:41:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance [12:41:06] (03CR) 10Federico Ceratto: "I wasn't aware of it because I was not added as reviewer initially. I'm closing this one and approving the other one." [cookbooks] - 10https://gerrit.wikimedia.org/r/1153989 (https://phabricator.wikimedia.org/T383760) (owner: 10Federico Ceratto) [12:41:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T395241)', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json [12:41:15] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:41:22] (03Abandoned) 10Federico Ceratto: pool.py: bugfix: remove diff check [cookbooks] - 10https://gerrit.wikimedia.org/r/1153989 (https://phabricator.wikimedia.org/T383760) (owner: 10Federico Ceratto) [12:41:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json [12:41:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json [12:42:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:43:16] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet [12:46:28] (03CR) 10Ladsgroup: "That was due to ports mixing, I don't think this patch impacts that. I add Alex just in case though." [puppet] - 10https://gerrit.wikimedia.org/r/1153573 (https://phabricator.wikimedia.org/T395999) (owner: 10Ladsgroup) [12:47:49] (03CR) 10Ladsgroup: "already cherry-picked in beta" [puppet] - 10https://gerrit.wikimedia.org/r/1153643 (https://phabricator.wikimedia.org/T396012) (owner: 10Ladsgroup) [12:48:03] (03PS3) 10Ladsgroup: beta: Add config for w.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153643 (https://phabricator.wikimedia.org/T396012) [12:48:16] (03CR) 10Ladsgroup: [V:03+2 C:03+2] beta: Add config for w.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153643 (https://phabricator.wikimedia.org/T396012) (owner: 10Ladsgroup) [12:48:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [12:48:32] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet [12:49:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T395241)', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json [12:50:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2151 T395989', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json [12:51:00] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [12:51:22] (03PS1) 10Marostegui: db2151: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154019 (https://phabricator.wikimedia.org/T395989) [12:51:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance [12:51:51] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:52:09] (03CR) 10Marostegui: [C:03+2] db2151: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154019 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [12:52:25] FIRING: SystemdUnitFailed: wmf_auto_restart_sfacctd.service on netflow7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [12:53:23] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [12:54:11] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet [12:54:13] (03PS5) 10Fabfur: cache::haproxy: fully set x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [12:54:44] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet [12:54:51] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:54:57] (03PS6) 10Fabfur: cache::haproxy: fully set x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [12:55:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json [12:56:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json [12:56:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json [12:56:49] (03CR) 10Fabfur: cache::haproxy: fully set x-provenance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [12:57:19] (03CR) 10Vgutierrez: [C:03+2] Revert "liberica: Don't deploy ipip-multiqueue-optimizer with katran" [puppet] - 10https://gerrit.wikimedia.org/r/1153985 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T1300) [13:00:05] georgekyz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:31] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [13:01:34] Hey folks we are going to backport deploy: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1153945 [13:02:35] ok, go ahead [13:03:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153945 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [13:03:23] FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [13:03:27] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" [13:03:54] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" [13:04:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json [13:04:26] (03Merged) 10jenkins-bot: ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153945 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [13:04:48] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1153945|ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] [13:04:52] T395823: [batch #2] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395823 [13:07:04] !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1153945|ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:14] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:09:35] (03PS7) 10Fabfur: cache::haproxy: fully set x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [13:09:36] !log gkyziridis@deploy1003 gkyziridis: Continuing with sync [13:10:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json [13:10:48] We had test the changes via wikimediaDebug plugin, it is syncing now [13:10:57] (03PS1) 10Jelto: gitlab: enable object storage on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154020 (https://phabricator.wikimedia.org/T378922) [13:11:14] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:11:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json [13:11:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json [13:12:25] (03CR) 10Vgutierrez: trafficserver: Add redirect rules for url shortener of beta cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1153632 (https://phabricator.wikimedia.org/T396012) (owner: 10Ladsgroup) [13:12:33] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet [13:12:54] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1154020 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:14:13] (03PS1) 10Muehlenhoff: Remove netflow7001 from Kafka Jumbo ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1154022 (https://phabricator.wikimedia.org/T394263) [13:15:13] (03CR) 10Btullis: [C:03+1] Remove netflow7001 from Kafka Jumbo ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1154022 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:15:20] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:15:58] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 53480448 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:16:39] !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153945|ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) [13:16:43] T395823: [batch #2] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395823 [13:16:58] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 196824 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:17:00] finished [13:17:16] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [13:17:23] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153292 (owner: 10PipelineBot) [13:17:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:18:11] (03CR) 10Muehlenhoff: [C:03+2] Remove netflow7001 from Kafka Jumbo ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1154022 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:18:20] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:18:47] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151339 (owner: 10PipelineBot) [13:18:58] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet [13:19:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json [13:19:40] (03CR) 10Arnaudb: [C:03+1] gitlab: enable object storage on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154020 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:21:35] (03CR) 10Vgutierrez: [C:03+1] cache::haproxy: fully set x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [13:21:36] !log UTC afternoon backport+config window done [13:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:40] georgekyz: thanks for deploying ^^ [13:21:52] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:21:55] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [13:22:16] Lucas_WMDE: Thnx for keeping an eye over my first deployment [13:22:22] 🥳 [13:23:14] 06SRE, 10Ganeti, 06Traffic: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10887649 (10MoritzMuehlenhoff) >>! In T396015#10884209, @ssingh wrote: > @Muehlenhoff: Both of these are decommissioned. Let me know if any other action is required from my end, thanks! Thanks! When th... [13:23:15] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [13:23:15] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:23:16] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet [13:23:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10887650 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `netflow7001.magru.wmnet` - netflow7001.magru.wmnet (**PASS**... [13:24:40] (03PS1) 10Vgutierrez: Revert^2 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154027 [13:24:52] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:25:13] (03PS2) 10Vgutierrez: Revert^2 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154027 (https://phabricator.wikimedia.org/T395228) [13:25:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json [13:26:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json [13:26:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json [13:28:01] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [13:28:13] jmm@cumin1003 drain-node (PID 312019) is awaiting input [13:28:59] (03CR) 10Federico Ceratto: [C:03+1] "I wrote https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1153989 before getting the notification for being added as a reviewer here" [cookbooks] - 10https://gerrit.wikimedia.org/r/1153671 (https://phabricator.wikimedia.org/T383760) (owner: 10Ladsgroup) [13:29:18] (03CR) 10Jelto: [C:03+1] "looks good to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153982 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm) [13:31:20] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153979 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm) [13:31:59] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet [13:34:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T395241)', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json [13:34:54] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance [13:35:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T395241)', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json [13:36:08] (03PS2) 10Ladsgroup: trafficserver: Add redirect rules for url shortener of beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153632 (https://phabricator.wikimedia.org/T396012) [13:36:31] (03CR) 10CI reject: [V:04-1] trafficserver: Add redirect rules for url shortener of beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153632 (https://phabricator.wikimedia.org/T396012) (owner: 10Ladsgroup) [13:37:26] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet [13:37:33] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet [13:38:53] (03CR) 10Ladsgroup: [C:03+2] sre.mysql.pool: Remove diff check functionality [cookbooks] - 10https://gerrit.wikimedia.org/r/1153671 (https://phabricator.wikimedia.org/T383760) (owner: 10Ladsgroup) [13:38:57] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet [13:40:03] (03PS3) 10Ladsgroup: trafficserver: Add redirect rules for url shortener of beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153632 (https://phabricator.wikimedia.org/T396012) [13:40:15] !log tgr@deploy1003 Locking from deployment [MediaWiki]: T395468 [13:40:38] !log installing net-tools bugfix updates for bookworm [13:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json [13:41:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json [13:41:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json [13:43:18] (03CR) 10JMeybohm: [C:03+1] shellbox-constraints: Actually bump resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153973 (owner: 10Clément Goubert) [13:43:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T395241)', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json [13:43:36] (03CR) 10Clément Goubert: [C:03+2] shellbox-constraints: Actually bump resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153973 (owner: 10Clément Goubert) [13:43:50] (03CR) 10Hnowlan: [C:03+1] shellbox-constraints: Actually bump resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153973 (owner: 10Clément Goubert) [13:44:19] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet [13:44:52] (03CR) 10Vgutierrez: "looking good, please provide a PCC run :)" [puppet] - 10https://gerrit.wikimedia.org/r/1153632 (https://phabricator.wikimedia.org/T396012) (owner: 10Ladsgroup) [13:45:12] (03Merged) 10jenkins-bot: shellbox-constraints: Actually bump resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153973 (owner: 10Clément Goubert) [13:45:48] (03CR) 10Vgutierrez: [C:03+1] haproxy: remove conditionals on wikimedia_trust [puppet] - 10https://gerrit.wikimedia.org/r/1152894 (owner: 10Giuseppe Lavagetto) [13:45:49] (03PS1) 10Tiziano Fogli: Alertmanager: route UnknownLogins to multiple receivers [puppet] - 10https://gerrit.wikimedia.org/r/1154031 (https://phabricator.wikimedia.org/T395117) [13:46:13] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:47:08] (03Merged) 10jenkins-bot: sre.mysql.pool: Remove diff check functionality [cookbooks] - 10https://gerrit.wikimedia.org/r/1153671 (https://phabricator.wikimedia.org/T383760) (owner: 10Ladsgroup) [13:47:39] (03PS4) 10Ladsgroup: trafficserver: Add redirect rules for url shortener of beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153632 (https://phabricator.wikimedia.org/T396012) [13:47:44] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [13:47:48] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1153632 (https://phabricator.wikimedia.org/T396012) (owner: 10Ladsgroup) [13:48:02] (03PS3) 10Giuseppe Lavagetto: analytics::packages::common: include ua-parser regexes.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1153620 (https://phabricator.wikimedia.org/T394794) [13:48:15] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [13:48:33] !log upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - T395228 [13:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:36] T395228: Test katran forwarding plane on lvs1013 - https://phabricator.wikimedia.org/T395228 [13:48:45] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [13:49:16] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [13:49:37] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [13:49:43] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [13:49:46] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet [13:49:48] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [13:49:55] 10SRE-SLO: Add a section to the SLO template that explains Pyrra's dashboards and alerts - https://phabricator.wikimedia.org/T395920#10887711 (10elukey) @Vgutierrez @herron I added more stuff to https://wikitech.wikimedia.org/wiki/SLO/Template_instructions/Dashboards_and_alerts, especially related to alerting. I... [13:50:01] (03CR) 10Vgutierrez: [C:03+1] trafficserver: Add redirect rules for url shortener of beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153632 (https://phabricator.wikimedia.org/T396012) (owner: 10Ladsgroup) [13:50:20] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet [13:50:22] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [13:50:43] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.71 ms [13:50:57] (03CR) 10Tiziano Fogli: "This is a proposal to send the alerts to all relevant recipients using the existing Alertmanager receivers, as listed in the commit messag" [puppet] - 10https://gerrit.wikimedia.org/r/1154031 (https://phabricator.wikimedia.org/T395117) (owner: 10Tiziano Fogli) [13:51:25] (03PS1) 10Marostegui: mariadb s2 codfw: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1154032 (https://phabricator.wikimedia.org/T383795) [13:51:31] !log Migrate s2 codfw to SBR dbmaint T383795 [13:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:33] T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795 [13:51:51] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:51:54] (03PS1) 10Vgutierrez: Revert^2 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154033 [13:52:08] (03CR) 10CI reject: [V:04-1] Revert^2 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154033 (owner: 10Vgutierrez) [13:52:23] (03CR) 10Marostegui: "This is noop per se. I will run this live on the hosts to get the config online." [puppet] - 10https://gerrit.wikimedia.org/r/1154032 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [13:52:23] (03PS2) 10Vgutierrez: Revert^2 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154033 (https://phabricator.wikimedia.org/T395228) [13:52:25] (03CR) 10Marostegui: [C:03+2] mariadb s2 codfw: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1154032 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [13:52:38] (03CR) 10CI reject: [V:04-1] Revert^2 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154033 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [13:53:20] (03PS3) 10Vgutierrez: Revert^2 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154033 (https://phabricator.wikimedia.org/T395228) [13:53:28] (03CR) 10SBassett: "That's fine. But int-admin and centralnoticeadmin could still do a decent amount of damage and oversighters would still theoretically hav" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153674 (https://phabricator.wikimedia.org/T396061) (owner: 10Lucas Werkmeister) [13:54:51] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:54:54] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10887733 (10Jgreen) [13:55:35] (03PS2) 10Tiziano Fogli: Alertmanager: route UnknownLogins to multiple receivers [puppet] - 10https://gerrit.wikimedia.org/r/1154031 (https://phabricator.wikimedia.org/T395117) [13:55:41] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10887736 (10Jgreen) [13:56:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json [13:56:18] (03PS3) 10Tiziano Fogli: Alertmanager: route UnknownLogins to multiple receivers [puppet] - 10https://gerrit.wikimedia.org/r/1154031 (https://phabricator.wikimedia.org/T395117) [13:56:26] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10887740 (10Jgreen) [13:57:02] (03PS4) 10Tiziano Fogli: Alertmanager: route UnknownLogins to multiple receivers [puppet] - 10https://gerrit.wikimedia.org/r/1154031 (https://phabricator.wikimedia.org/T395117) [13:58:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json [13:58:37] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10887746 (10Jgreen) [13:59:05] (03CR) 10Fabfur: [C:03+1] "godspeed!" [puppet] - 10https://gerrit.wikimedia.org/r/1154027 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [13:59:13] (03PS5) 10Ladsgroup: trafficserver: Add redirect rules for url shortener of beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153632 (https://phabricator.wikimedia.org/T396012) [13:59:16] (03CR) 10Ladsgroup: [C:03+2] trafficserver: Add redirect rules for url shortener of beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153632 (https://phabricator.wikimedia.org/T396012) (owner: 10Ladsgroup) [13:59:18] (03CR) 10Ladsgroup: [V:03+2 C:03+2] trafficserver: Add redirect rules for url shortener of beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1153632 (https://phabricator.wikimedia.org/T396012) (owner: 10Ladsgroup) [13:59:23] (03CR) 10Fabfur: [C:03+1] "good luck!" [puppet] - 10https://gerrit.wikimedia.org/r/1154033 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [14:00:42] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10887763 (10Jgreen) [14:02:36] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10887767 (10Jgreen) [14:04:32] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10887771 (10Jgreen) [14:06:48] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10887776 (10Jgreen) [14:07:53] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10887780 (10Jgreen) [14:08:05] (03CR) 10SBassett: [C:03+1] Alertmanager: route UnknownLogins to multiple receivers [puppet] - 10https://gerrit.wikimedia.org/r/1154031 (https://phabricator.wikimedia.org/T395117) (owner: 10Tiziano Fogli) [14:09:48] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10887788 (10Jgreen) [14:11:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json [14:11:26] (03CR) 10Muehlenhoff: [C:03+2] cloudcontrol/codfw1dev: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1153970 (owner: 10Muehlenhoff) [14:12:16] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10887800 (10Jgreen) [14:12:45] (03CR) 10Fabfur: [C:03+2] cache::haproxy: fully set x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1152887 (https://phabricator.wikimedia.org/T392217) (owner: 10Giuseppe Lavagetto) [14:12:55] 10ops-codfw, 06SRE, 06SRE-OnFire, 10Cassandra, and 3 others: additional sessionstore expansion — codfw - https://phabricator.wikimedia.org/T395954#10887801 (10Jhancock.wm) i also have 6 x 960GB 6 x 1.92 TB [14:13:33] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [14:13:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json [14:14:19] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990#10887805 (10Jhancock.wm) weird. that was the only one blinking like that. i checked both enclosures. [14:14:35] (03CR) 10Eevans: [C:03+2] restbase: upgrade Cassandra on restbase2012 & restbase1016 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914797 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [14:14:46] (03CR) 10Vgutierrez: [C:03+2] Revert^2 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154027 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [14:17:05] !log deploying a PrivateSettings config change [14:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:16] (03Abandoned) 10Eevans: enable authenticated access to Cassandra JMX [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/196133 (https://phabricator.wikimedia.org/T92471) (owner: 10Eevans) [14:17:20] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" [14:17:29] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" [14:17:30] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:17:59] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 [14:18:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 [14:18:31] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [14:18:47] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [14:18:55] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, let's try it like this and see how we go." [puppet] - 10https://gerrit.wikimedia.org/r/1154031 (https://phabricator.wikimedia.org/T395117) (owner: 10Tiziano Fogli) [14:19:22] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran [14:19:52] (03CR) 10Filippo Giunchedi: [C:03+1] ircecho3.py: fix debug output [puppet] - 10https://gerrit.wikimedia.org/r/1153952 (https://phabricator.wikimedia.org/T389937) (owner: 10Tiziano Fogli) [14:19:54] !log tgr@deploy1003 Unlocked for deployment [MediaWiki]: T395468 (duration: 39m 39s) [14:20:17] (03PS1) 10Muehlenhoff: cloudcontrol/eqiad1: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1154037 [14:20:18] (03CR) 10Filippo Giunchedi: [C:03+1] standard_packages: Handle dnsutils/bind9-dnsutils correctly across all OSes [puppet] - 10https://gerrit.wikimedia.org/r/1152246 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [14:20:32] !log taavi@cumin1002 START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors [14:20:33] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors [14:20:51] (03PS4) 10Vgutierrez: Revert^2 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154033 (https://phabricator.wikimedia.org/T395228) [14:22:08] (03CR) 10Filippo Giunchedi: "While this would work, I think silencing and/or ack'ing the alerts would be simpler. Note that silences/acks can be created at any time, t" [alerts] - 10https://gerrit.wikimedia.org/r/1148197 (owner: 10Slyngshede) [14:23:03] (03CR) 10Tiziano Fogli: [C:03+2] Alertmanager: route UnknownLogins to multiple receivers [puppet] - 10https://gerrit.wikimedia.org/r/1154031 (https://phabricator.wikimedia.org/T395117) (owner: 10Tiziano Fogli) [14:24:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154037 (owner: 10Muehlenhoff) [14:25:59] (03CR) 10Vgutierrez: [C:03+2] Revert^2 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154033 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [14:28:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T395241)', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json [14:28:51] (03PS1) 10Majavah: P:openstack: pdns: Do not log grants with passwords to console [puppet] - 10https://gerrit.wikimedia.org/r/1154039 [14:29:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance [14:29:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2236 (T395241)', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json [14:30:02] (03CR) 10Majavah: [C:03+1] cloudcontrol/eqiad1: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1154037 (owner: 10Muehlenhoff) [14:31:19] (03CR) 10CDanis: [C:03+1] analytics::packages::common: include ua-parser regexes.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1153620 (https://phabricator.wikimedia.org/T394794) (owner: 10Giuseppe Lavagetto) [14:36:03] (03CR) 10Giuseppe Lavagetto: [C:03+2] analytics::packages::common: include ua-parser regexes.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1153620 (https://phabricator.wikimedia.org/T394794) (owner: 10Giuseppe Lavagetto) [14:37:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T395241)', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json [14:37:28] (03CR) 10Dzahn: [C:03+2] admin: add user corvus and add them to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/1153700 (https://phabricator.wikimedia.org/T395167) (owner: 10Dzahn) [14:38:16] in case you get stuck at "multiple", yes please just merge both :) [14:38:27] ah, there we go. all good [14:40:54] (03PS2) 10Majavah: P:openstack: pdns: Do not log grants with passwords to console [puppet] - 10https://gerrit.wikimedia.org/r/1154039 [14:40:54] (03PS1) 10Majavah: hieradata: Enable BGP on IPv6 for all cloudlb2* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154040 (https://phabricator.wikimedia.org/T379282) [14:42:06] (03CR) 10Andrew Bogott: [C:03+1] hieradata: Enable BGP on IPv6 for all cloudlb2* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154040 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:43:06] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1154040 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:43:51] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Enable BGP on IPv6 for all cloudlb2* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154040 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:50:16] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet [14:52:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json [14:53:19] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:56:18] (03PS1) 10Muehlenhoff: Record LDAP access for toluayo [puppet] - 10https://gerrit.wikimedia.org/r/1154041 [14:57:19] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:57:54] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for toluayo [puppet] - 10https://gerrit.wikimedia.org/r/1154041 (owner: 10Muehlenhoff) [14:58:30] (03CR) 10FNegri: [C:03+1] P:openstack: pdns: Do not log grants with passwords to console [puppet] - 10https://gerrit.wikimedia.org/r/1154039 (owner: 10Majavah) [14:58:42] (03CR) 10Majavah: [C:03+2] P:openstack: pdns: Do not log grants with passwords to console [puppet] - 10https://gerrit.wikimedia.org/r/1154039 (owner: 10Majavah) [14:59:07] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet [14:59:14] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet [14:59:22] (03PS1) 10Tiziano Fogli: prometheus7002: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1154042 (https://phabricator.wikimedia.org/T395130) [15:00:05] dduvall and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Train log triage . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T1500). [15:01:11] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for "Dena WMDE" - https://phabricator.wikimedia.org/T396129 (10dena) 03NEW [15:01:34] (03CR) 10Tiziano Fogli: [C:03+2] ircecho3.py: fix debug output [puppet] - 10https://gerrit.wikimedia.org/r/1153952 (https://phabricator.wikimedia.org/T389937) (owner: 10Tiziano Fogli) [15:02:19] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:03:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1154042 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [15:05:18] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for "Dena WMDE" - https://phabricator.wikimedia.org/T396129#10888054 (10WMDE-leszek) I approve this request on WMDE's end. Account already in `nde` and `wmde` gr... [15:05:19] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:05:49] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for "Dena WMDE" - https://phabricator.wikimedia.org/T396129#10888056 (10WMDE-leszek) [15:06:14] (03CR) 10Tiziano Fogli: [C:03+2] prometheus7002: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1154042 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:15] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:07:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json [15:08:03] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet [15:10:15] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:11:22] (03PS1) 10Tiziano Fogli: prometheus7002: assign replica_label [puppet] - 10https://gerrit.wikimedia.org/r/1154046 (https://phabricator.wikimedia.org/T395130) [15:13:01] (03PS1) 10Fabfur: cache::haproxy: fix set-header syntax [puppet] - 10https://gerrit.wikimedia.org/r/1154047 (https://phabricator.wikimedia.org/T392217) [15:13:25] (03CR) 10CDanis: [C:03+1] cache::haproxy: fix set-header syntax [puppet] - 10https://gerrit.wikimedia.org/r/1154047 (https://phabricator.wikimedia.org/T392217) (owner: 10Fabfur) [15:14:19] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:19] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:18:23] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154047 (https://phabricator.wikimedia.org/T392217) (owner: 10Fabfur) [15:18:52] (03CR) 10Fabfur: [C:03+2] cache::haproxy: fix set-header syntax [puppet] - 10https://gerrit.wikimedia.org/r/1154047 (https://phabricator.wikimedia.org/T392217) (owner: 10Fabfur) [15:19:03] (03CR) 10Giuseppe Lavagetto: [C:03+1] cache::haproxy: fix set-header syntax [puppet] - 10https://gerrit.wikimedia.org/r/1154047 (https://phabricator.wikimedia.org/T392217) (owner: 10Fabfur) [15:22:21] (03PS1) 10Majavah: hieradata: Add missing cloud-realm public VIP v6 ranges [puppet] - 10https://gerrit.wikimedia.org/r/1154050 (https://phabricator.wikimedia.org/T379282) [15:22:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T395241)', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json [15:23:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance [15:23:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2237 (T395241)', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json [15:23:55] (03PS2) 10Scott French: scap: block interactive maintenance scripts on mwmaint [puppet] - 10https://gerrit.wikimedia.org/r/1152820 (https://phabricator.wikimedia.org/T341553) [15:23:56] (03CR) 10Majavah: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5774/console" [puppet] - 10https://gerrit.wikimedia.org/r/1154040 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [15:30:34] (03CR) 10Scott French: "Thank you both for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1152820 (https://phabricator.wikimedia.org/T341553) (owner: 10Scott French) [15:30:36] (03CR) 10Scott French: [C:03+2] scap: block interactive maintenance scripts on mwmaint [puppet] - 10https://gerrit.wikimedia.org/r/1152820 (https://phabricator.wikimedia.org/T341553) (owner: 10Scott French) [15:30:57] (03PS2) 10Majavah: hieradata: Add missing cloud-realm public VIP v6 ranges [puppet] - 10https://gerrit.wikimedia.org/r/1154050 (https://phabricator.wikimedia.org/T379282) [15:31:32] (03CR) 10CI reject: [V:04-1] hieradata: Add missing cloud-realm public VIP v6 ranges [puppet] - 10https://gerrit.wikimedia.org/r/1154050 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [15:31:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T395241)', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json [15:32:05] (03PS3) 10Majavah: hieradata: Add missing cloud-realm public VIP v6 ranges [puppet] - 10https://gerrit.wikimedia.org/r/1154050 (https://phabricator.wikimedia.org/T379282) [15:32:33] RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:34:26] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1154050 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [15:35:33] PROBLEM - Hadoop NodeManager on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:39:02] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, I'll abandon https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153126" [puppet] - 10https://gerrit.wikimedia.org/r/1154046 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [15:39:04] (03CR) 10Cathal Mooney: [C:03+1] "Yep looks good. Took me a minute to grok why we didn't need the nftables rules for the v6 ranges too, but I see the reason in v4 is to no" [puppet] - 10https://gerrit.wikimedia.org/r/1154050 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [15:39:18] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Add missing cloud-realm public VIP v6 ranges [puppet] - 10https://gerrit.wikimedia.org/r/1154050 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [15:46:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json [15:51:51] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:52:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [15:54:51] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:55:17] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration [15:56:53] (03PS1) 10Bernard Wang: Enable empty search recommendations for Vector on all wikipedias, and for Minerva on group1 wikis and wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154057 [15:57:07] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:57:45] (03PS2) 10Bernard Wang: Enable empty search recommendations for Vector on all wikipedias, and for Minerva on group1 wikis and wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154057 (https://phabricator.wikimedia.org/T395344) [15:59:44] (03PS1) 10Clare Ming: xLab: Deploying xLab v0.6.5 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154058 (https://phabricator.wikimedia.org/T395922) [16:00:04] jhathaway and moritzm: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T1600). Please do the needful. [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:01:15] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" [16:01:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" [16:01:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:01:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json [16:02:47] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2244 [16:02:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 [16:03:20] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10888267 (10Dzahn) Hey @Corvus you have access now to hosts in production. Your SSH setup will be very similar to what you already... [16:03:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:03:29] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10888271 (10Dzahn) 05In progress→03Resolved a:03Dzahn [16:06:48] jhancock@cumin2002 provision (PID 1756132) is awaiting input [16:07:43] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:08:11] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:09:18] (03PS3) 10Effie Mouzeli: mw-experimental-mediawiki-image-update: improve script [puppet] - 10https://gerrit.wikimedia.org/r/1153999 [16:09:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Arelion (2001:2035:0:15b5::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [16:09:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154057 (https://phabricator.wikimedia.org/T395344) (owner: 10Bernard Wang) [16:09:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10888290 (10Stevemunene) [16:10:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10888294 (10Stevemunene) 05Open→03Resolved The hosts are back in the analytics cluster {F61670396} [16:11:33] @thcipriani we have a developing community conversation and we need to deploy a config change before it spirals further. Can I do an out of bounds deployment now if the window is free? [16:11:34] (03CR) 10Alexandros Kosiaris: [C:03+1] admin_ng: Split envoyfilters installation into a separate release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153979 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm) [16:11:54] jouncebot: nowandnext [16:11:54] For the next 0 hour(s) and 48 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T1600) [16:11:54] In 0 hour(s) and 48 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T1700) [16:11:54] In 0 hour(s) and 48 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T1700) [16:11:57] Change is scheduled for 4hrs from now (https://gerrit.wikimedia.org/r/c/1153750/) but ideally should go out asap [16:12:42] Jdlrobson: looks like you've got time, good by me. [16:12:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:12:51] thanks thcipriani ! okay doing this now [16:13:02] (03CR) 10Dduvall: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1152145 (owner: 10Dzahn) [16:14:05] (03PS2) 10Jdlrobson: Revert "Deploy survey to en at twenty percent" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153750 (owner: 10Jdrewniak) [16:14:27] herron and urandom FYI as on-call SRE folks: doing an out of window backport (also dduvall as train conductor) [16:14:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153750 (owner: 10Jdrewniak) [16:14:48] thcipriani: ack ok [16:15:41] (03Merged) 10jenkins-bot: Revert "Deploy survey to en at twenty percent" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153750 (owner: 10Jdrewniak) [16:16:07] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1153750|Revert "Deploy survey to en at twenty percent"]] [16:17:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T395241)', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json [16:18:17] !log jdlrobson@deploy1003 jdlrobson, jdrewniak: Backport for [[gerrit:1153750|Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:20:17] ok patch looks good so syncing now [16:20:31] !log jdlrobson@deploy1003 jdlrobson, jdrewniak: Continuing with sync [16:20:48] * thcipriani continues stalking in spiderpig :) [16:21:11] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:21:19] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10888315 (10Jhancock.wm) [16:21:43] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:24:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Arelion (2001:2035:0:15b5::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [16:27:31] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153750|Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) [16:28:33] okay done! Thanks thcipriani patting spiderpig on the head and giving him an apple as a reward! [16:29:07] heh, glad you were able to get it out, thanks for the spiderpig maintenance :) [16:29:44] (03PS1) 10Effie Mouzeli: x-wikimedia-debug-routing: add mw-experimental hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154069 (https://phabricator.wikimedia.org/T276994) [16:30:28] (03PS1) 10Effie Mouzeli: debug.json: add mw-experimental hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154070 (https://phabricator.wikimedia.org/T276994) [16:45:29] (03PS1) 10Aqu: Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) [16:50:08] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots [16:50:27] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 [16:50:29] (03PS1) 10Andrew Bogott: Add openstack octavia logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/1154072 (https://phabricator.wikimedia.org/T395864) [16:50:48] (03PS2) 10Andrew Bogott: Add openstack octavia logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/1154072 (https://phabricator.wikimedia.org/T395864) [16:51:27] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) [16:52:11] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154072 (https://phabricator.wikimedia.org/T395864) (owner: 10Andrew Bogott) [16:52:37] FIRING: SystemdUnitFailed: wmf_auto_restart_sfacctd.service on netflow7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:53:51] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [16:53:55] (03CR) 10Andrew Bogott: [C:03+2] Add openstack octavia logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/1154072 (https://phabricator.wikimedia.org/T395864) (owner: 10Andrew Bogott) [16:54:33] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet [16:54:36] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet [16:56:43] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:57:35] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:57:50] (03PS2) 10Aqu: Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) [16:58:10] FIRING: [2x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:58:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:59:35] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:59:43] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:00:05] bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T1700) [17:00:05] swfrench-wmf and jasmine_: Time to snap out of that daydream and deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T1700). [17:03:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:03:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:03:51] FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [17:07:32] (03PS1) 10Vgutierrez: Revert^3 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154081 [17:08:19] (03CR) 10Vgutierrez: [C:03+2] Revert^3 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154081 (owner: 10Vgutierrez) [17:15:39] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet [17:15:39] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet [17:16:21] (03PS1) 10Vgutierrez: Revert^3 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154083 [17:16:47] (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-06-02-122807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154084 [17:17:52] (03CR) 10Vgutierrez: [C:03+2] Revert^3 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154083 (owner: 10Vgutierrez) [17:20:11] (03PS1) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [17:20:35] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 145 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:20:39] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-06-02-122807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154084 (owner: 10BryanDavis) [17:21:04] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [17:21:10] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [17:22:13] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-06-02-122807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154084 (owner: 10BryanDavis) [17:23:11] PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - free space: / 2115 MB (3% inode=95%): /tmp 2115 MB (3% inode=95%): /var/tmp 2115 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops [17:24:26] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:24:41] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:24:53] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:25:17] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:26:54] (03PS2) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [17:27:58] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:28:15] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:45:02] (03PS2) 10Volans: sre.hardware.upgrade-firmware: add support for SSD [cookbooks] - 10https://gerrit.wikimedia.org/r/1150728 (https://phabricator.wikimedia.org/T394543) [17:48:34] (03CR) 10Volans: "Ready for final testing (see task)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1150728 (https://phabricator.wikimedia.org/T394543) (owner: 10Volans) [17:48:35] (03PS1) 10Clare Ming: xLab: Deploy v0.6.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154090 (https://phabricator.wikimedia.org/T395922) [17:48:55] (03Abandoned) 10Clare Ming: xLab: Deploying xLab v0.6.5 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154058 (https://phabricator.wikimedia.org/T395922) (owner: 10Clare Ming) [17:50:11] (03PS2) 10Clare Ming: xLab: Deploy v0.6.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154090 (https://phabricator.wikimedia.org/T395922) [17:55:54] (03PS1) 10Clare Ming: xLab: Deploy v0.6.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154093 (https://phabricator.wikimedia.org/T395922) [17:58:20] (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.6.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154090 (https://phabricator.wikimedia.org/T395922) (owner: 10Clare Ming) [17:59:38] (03Merged) 10jenkins-bot: xLab: Deploy v0.6.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154090 (https://phabricator.wikimedia.org/T395922) (owner: 10Clare Ming) [18:00:04] dduvall and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T1800). [18:04:21] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [18:04:43] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10888642 (10Jclark-ctr) [18:05:05] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [18:10:07] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:11:24] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154097 (https://phabricator.wikimedia.org/T392174) [18:11:25] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154097 (https://phabricator.wikimedia.org/T392174) (owner: 10TrainBranchBot) [18:12:24] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154097 (https://phabricator.wikimedia.org/T392174) (owner: 10TrainBranchBot) [18:14:36] (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.6.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154093 (https://phabricator.wikimedia.org/T395922) (owner: 10Clare Ming) [18:15:07] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:16:11] (03Merged) 10jenkins-bot: xLab: Deploy v0.6.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154093 (https://phabricator.wikimedia.org/T395922) (owner: 10Clare Ming) [18:17:17] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:17:46] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:17:50] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) [18:17:54] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [18:18:18] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [18:19:13] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:20:41] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 [18:20:49] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 [18:21:17] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:21:45] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs T392174 [18:21:51] T392174: 1.45.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T392174 [18:25:01] (03CR) 10Aleksandar Mastilovic: Removing WM Enterprise downloader Puppet configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [18:29:09] vriley@cumin1002 provision (PID 1521340) is awaiting input [18:30:11] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:32:06] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:33:10] (03PS1) 10Jdlrobson: Fix back compat for data-chart [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1154098 (https://phabricator.wikimedia.org/T395462) [18:33:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1154098 (https://phabricator.wikimedia.org/T395462) (owner: 10Jdlrobson) [18:33:52] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10888699 (10Jhancock.wm) [18:34:14] (03PS1) 10Jdlrobson: Enable anonymous previews on beta cluster for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154099 [18:40:12] (03CR) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway) [18:43:57] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:44:27] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:47:12] (03CR) 10AOkoth: [C:03+2] trafficserver: point os-reports to k8s record [puppet] - 10https://gerrit.wikimedia.org/r/1152305 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [18:47:27] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:48:08] vriley@cumin1002 provision (PID 1544797) is awaiting input [18:48:29] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:49:29] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [18:49:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10888726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [18:49:46] (03CR) 10Xcollazo: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [18:52:02] !log Disabled the SDS 2.4.11 Synthetic A/A Test in xLab [18:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:26] (03CR) 10Dzahn: [C:03+1] gitlab: enable object storage on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154020 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [19:01:00] (03PS2) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1137463 [19:01:13] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:01:25] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:01:38] (03CR) 10Dzahn: "rebased into nothing :) already done meanwhile in another change - looks like it can simply be abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [19:02:25] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:04:23] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:06:11] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 7.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:06:15] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54083 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:07:31] (03CR) 10BCornwall: "Oops! Forgot there was already a CR open for this. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [19:07:37] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [19:12:07] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 [19:12:15] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 [19:12:42] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:13:52] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:13:57] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 [19:14:07] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 [19:14:25] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:15:36] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:21:58] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:23:11] PROBLEM - Disk space on an-worker1131 is CRITICAL: DISK CRITICAL - free space: / 2070 MB (3% inode=95%): /tmp 2070 MB (3% inode=95%): /var/tmp 2070 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1131&var-datasource=eqiad+prometheus/ops [19:24:44] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:25:38] 10SRE-SLO: Add a section to the SLO template that explains Pyrra's dashboards and alerts - https://phabricator.wikimedia.org/T395920#10888868 (10herron) >>! In T395920#10887711, @elukey wrote: > @Vgutierrez @herron I added more stuff to https://wikitech.wikimedia.org/wiki/SLO/Template_instructions/Dashboards_and... [19:34:55] (03PS3) 10Aleksandar Mastilovic: Deploy the root config folder to all Airflow deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153704 (https://phabricator.wikimedia.org/T383931) [19:41:26] (03PS4) 10Aleksandar Mastilovic: Deploy the root config folder to all Airflow deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153704 (https://phabricator.wikimedia.org/T383931) [19:52:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [19:59:33] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10888894 (10Volans) @bking @RKemper I'm ready with the final test for https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1150728 I... [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T2000). [20:00:05] jan_drewniak and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] I can deploy [20:00:25] (03CR) 10Btullis: [C:03+1] "OK, looks good.but I think that we need the `/config` directory to exist in airflow-dags first." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153704 (https://phabricator.wikimedia.org/T383931) (owner: 10Aleksandar Mastilovic) [20:00:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154099 (owner: 10Jdlrobson) [20:01:19] @dduvall just checking you are done with train etc? [20:01:39] Jdlrobson: yes! all done [20:01:40] (and @dancy ^) [20:01:43] cool! Starting now then [20:01:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154099 (owner: 10Jdlrobson) [20:02:37] (03Merged) 10jenkins-bot: Enable anonymous previews on beta cluster for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154099 (owner: 10Jdlrobson) [20:03:11] PROBLEM - Disk space on an-worker1131 is CRITICAL: DISK CRITICAL - free space: / 2058 MB (3% inode=95%): /tmp 2058 MB (3% inode=95%): /var/tmp 2058 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1131&var-datasource=eqiad+prometheus/ops [20:05:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1154098 (https://phabricator.wikimedia.org/T395462) (owner: 10Jdlrobson) [20:05:38] (03CR) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway) [20:09:45] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [20:09:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10888907 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [20:12:44] (03CR) 10BryanDavis: "You are correct @gtisza@wikimedia.org. I was looking at the changes in this patch and not the larger context of the config. Sorry for the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) (owner: 10Gergő Tisza) [20:13:52] (03Merged) 10jenkins-bot: Fix back compat for data-chart [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1154098 (https://phabricator.wikimedia.org/T395462) (owner: 10Jdlrobson) [20:14:06] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1154098|Fix back compat for data-chart (T395462)]] [20:14:10] T395462: Charts not being output correctly in Parsoid - https://phabricator.wikimedia.org/T395462 [20:14:53] (03PS1) 10Krinkle: wmf-config: Fix filename typo in InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154108 [20:15:41] PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:16:05] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1154098|Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:17:08] !log jdlrobson@deploy1003 jdlrobson: Continuing with sync [20:17:27] PROBLEM - Host logging-hd1005 is DOWN: PING CRITICAL - Packet loss = 100% [20:18:17] RECOVERY - Host logging-hd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [20:24:12] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154098|Fix back compat for data-chart (T395462)]] (duration: 10m 05s) [20:24:15] T395462: Charts not being output correctly in Parsoid - https://phabricator.wikimedia.org/T395462 [20:25:41] ok everything looking healthy so releasing spiderpig again :) [20:30:41] RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:44:27] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:46:11] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10889000 (10Jdforrester-WMF) [20:47:17] (03PS3) 10Aqu: Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) [20:47:27] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:48:47] (03CR) 10CI reject: [V:04-1] Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [20:49:55] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [20:50:06] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [20:51:55] (03PS1) 10Clare Ming: xLab: Deploy v0.6.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154114 [20:52:37] FIRING: SystemdUnitFailed: wmf_auto_restart_sfacctd.service on netflow7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:38] (03PS1) 10Clare Ming: xLab: Deploy v0.6.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154115 [20:53:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 253351936 and 22 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:54:26] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.6.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154114 (owner: 10Clare Ming) [20:55:40] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.6.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154115 (owner: 10Clare Ming) [20:55:51] (03Merged) 10jenkins-bot: xLab: Deploy v0.6.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154114 (owner: 10Clare Ming) [20:55:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 23720 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:56:29] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [20:56:39] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:56:41] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [20:57:06] (03Merged) 10jenkins-bot: xLab: Deploy v0.6.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154115 (owner: 10Clare Ming) [20:57:13] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [20:59:32] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250605T2100) [21:00:15] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [21:00:45] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [21:00:56] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [21:07:37] FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [21:14:27] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:16:32] (03PS1) 10Jforrester: captureSpeedtest: Drop PHP 7 check, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154121 [21:17:27] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:17:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 400861464 and 23 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:19:57] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 27872 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:24:02] (03CR) 10Máté Szabó: [C:03+2] captureSpeedtest: Drop PHP 7 check, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154121 (owner: 10Jforrester) [21:24:03] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [21:24:12] (03CR) 10Máté Szabó: captureSpeedtest: Drop PHP 7 check, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154121 (owner: 10Jforrester) [21:25:05] (03CR) 10Máté Szabó: [C:03+1] captureSpeedtest: Drop PHP 7 check, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154121 (owner: 10Jforrester) [21:29:39] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [21:42:27] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [21:57:45] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [21:58:13] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [21:58:25] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [22:02:43] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [22:03:05] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [22:03:29] (03PS1) 10Arlolra: Disable VipsScaler in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154128 (https://phabricator.wikimedia.org/T290759) [22:04:10] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [22:05:11] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [22:09:05] (03PS1) 10Thcipriani: Scap: (beta) Use php8.1 for mwscript/php-fpm restarts [puppet] - 10https://gerrit.wikimedia.org/r/1154131 (https://phabricator.wikimedia.org/T396158) [22:09:28] (03CR) 10Thcipriani: [C:04-1] Scap: (beta) Use php8.1 for mwscript/php-fpm restarts [puppet] - 10https://gerrit.wikimedia.org/r/1154131 (https://phabricator.wikimedia.org/T396158) (owner: 10Thcipriani) [22:16:19] (03PS1) 10BryanDavis: shellbox: Bump to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154132 (https://phabricator.wikimedia.org/T364249) [22:23:34] (03Abandoned) 10Jdlrobson: Enable dark mode on Wikidata for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152855 (https://phabricator.wikimedia.org/T395919) (owner: 10Jdlrobson) [22:56:53] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 915 MB (1% inode=98%): /tmp 915 MB (1% inode=98%): /var/tmp 915 MB (1% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [23:04:56] (03PS1) 10Krinkle: multiversion: Document how it all works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154136 (https://phabricator.wikimedia.org/T289318) [23:17:39] (03Abandoned) 10Thcipriani: Scap: (beta) Use php8.1 for mwscript/php-fpm restarts [puppet] - 10https://gerrit.wikimedia.org/r/1154131 (https://phabricator.wikimedia.org/T396158) (owner: 10Thcipriani) [23:30:54] (03PS1) 10Krinkle: multivesion: Remove unused newFromDBName() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154139 [23:30:58] (03PS1) 10Krinkle: multiversion: Re-use prod for beta setSiteInfoForWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) [23:31:22] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#10889545 (10Ladsgroup) I understand the need to have multi write backends an... [23:35:36] (03PS2) 10Krinkle: multiversion: Re-use prod for beta setSiteInfoForWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) [23:35:38] (03CR) 10BryanDavis: [C:04-1] "Waiting to find out if Iedf330b65785a3984fd41b8bb68cc61d86a23004 should happen first." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154132 (https://phabricator.wikimedia.org/T364249) (owner: 10BryanDavis) [23:36:27] (03CR) 10CI reject: [V:04-1] multiversion: Re-use prod for beta setSiteInfoForWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [23:36:29] (03CR) 10Krinkle: [C:04-1] multiversion: Re-use prod for beta setSiteInfoForWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [23:36:51] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#10889586 (10bd808) [23:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154142 [23:38:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154142 (owner: 10TrainBranchBot) [23:51:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154142 (owner: 10TrainBranchBot) [23:51:51] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:52:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [23:54:51] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process