[00:02:00] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:05:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P45141 and previous config saved to /var/cache/conftool/dbconfig/20230307-000512-marostegui.json [00:12:52] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:14:00] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn) [00:20:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P45142 and previous config saved to /var/cache/conftool/dbconfig/20230307-002019-marostegui.json [00:23:08] !log people* - determined which users did not have a public_html dir in codfw but did in eqiad. created that dir, rsynced via push from people1003 to people2002 for the 7 affected users. re-enabled temp disabled puppet to restore live-hacked rsync config. T330091 [00:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:14] T330091: Switchover People and Planet services to codfw - https://phabricator.wikimedia.org/T330091 [00:31:11] (03Abandoned) 10Sergio Gimeno: Enable the topic match mode in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830202 (https://phabricator.wikimedia.org/T305408) (owner: 10Sergio Gimeno) [00:35:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T329203)', diff saved to https://phabricator.wikimedia.org/P45143 and previous config saved to /var/cache/conftool/dbconfig/20230307-003525-marostegui.json [00:35:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [00:35:34] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [00:35:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [00:35:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T329203)', diff saved to https://phabricator.wikimedia.org/P45144 and previous config saved to /var/cache/conftool/dbconfig/20230307-003547-marostegui.json [00:43:52] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) I created a self dispatch it was rejected twice the reason being that i didn't specify the type or disk (480 or 8T) or I did, so I had to call Dell to clarify this. [00:46:33] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2094.codfw.wmnet - https://phabricator.wikimedia.org/T330828 (10Papaul) a:03Jhancock.wm [00:50:51] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2094.codfw.wmnet - https://phabricator.wikimedia.org/T330828 (10Papaul) @Jhancock.wm I think you can now do this from start to finish no need to assign the task back to me. So after you remove the disk and remove the server form the rack you can go... [00:52:47] 10SRE, 10ops-codfw, 10Wikimedia-Incident: 2022-12-15 codfw worker exhaustion - https://phabricator.wikimedia.org/T328353 (10Papaul) @andrea.denisse can we resolve this task ? [00:56:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T329203)', diff saved to https://phabricator.wikimedia.org/P45145 and previous config saved to /var/cache/conftool/dbconfig/20230307-005611-marostegui.json [00:56:19] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [01:07:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10phaultfinder) [01:11:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P45146 and previous config saved to /var/cache/conftool/dbconfig/20230307-011117-marostegui.json [01:26:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P45147 and previous config saved to /var/cache/conftool/dbconfig/20230307-012624-marostegui.json [01:41:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T329203)', diff saved to https://phabricator.wikimedia.org/P45148 and previous config saved to /var/cache/conftool/dbconfig/20230307-014130-marostegui.json [01:41:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [01:41:38] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [01:41:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [01:41:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2161 (T329203)', diff saved to https://phabricator.wikimedia.org/P45149 and previous config saved to /var/cache/conftool/dbconfig/20230307-014152-marostegui.json [02:03:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T329203)', diff saved to https://phabricator.wikimedia.org/P45150 and previous config saved to /var/cache/conftool/dbconfig/20230307-020330-marostegui.json [02:03:38] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P45151 and previous config saved to /var/cache/conftool/dbconfig/20230307-021837-marostegui.json [02:21:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:44] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:33:32] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:33:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P45152 and previous config saved to /var/cache/conftool/dbconfig/20230307-023344-marostegui.json [02:48:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T329203)', diff saved to https://phabricator.wikimedia.org/P45153 and previous config saved to /var/cache/conftool/dbconfig/20230307-024850-marostegui.json [02:48:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2162.codfw.wmnet with reason: Maintenance [02:48:58] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [02:49:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2162.codfw.wmnet with reason: Maintenance [02:49:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T329203)', diff saved to https://phabricator.wikimedia.org/P45154 and previous config saved to /var/cache/conftool/dbconfig/20230307-024912-marostegui.json [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T0300) [03:05:58] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:08:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.26 [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/894579 (https://phabricator.wikimedia.org/T330204) [03:08:08] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.26 [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/894579 (https://phabricator.wikimedia.org/T330204) (owner: 10TrainBranchBot) [03:10:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T329203)', diff saved to https://phabricator.wikimedia.org/P45155 and previous config saved to /var/cache/conftool/dbconfig/20230307-031000-marostegui.json [03:10:09] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [03:25:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P45156 and previous config saved to /var/cache/conftool/dbconfig/20230307-032506-marostegui.json [03:25:15] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.26 [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/894579 (https://phabricator.wikimedia.org/T330204) (owner: 10TrainBranchBot) [03:38:30] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:40:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P45157 and previous config saved to /var/cache/conftool/dbconfig/20230307-034013-marostegui.json [03:55:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T329203)', diff saved to https://phabricator.wikimedia.org/P45158 and previous config saved to /var/cache/conftool/dbconfig/20230307-035520-marostegui.json [03:55:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [03:55:27] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [03:55:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [03:55:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T329203)', diff saved to https://phabricator.wikimedia.org/P45159 and previous config saved to /var/cache/conftool/dbconfig/20230307-035541-marostegui.json [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T0400) [04:00:42] (03PS1) 10Kimberly Sarabia: Add header at top of main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894765 (https://phabricator.wikimedia.org/T325362) [04:06:28] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:56] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:37:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [04:56:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T329203)', diff saved to https://phabricator.wikimedia.org/P45160 and previous config saved to /var/cache/conftool/dbconfig/20230307-045607-marostegui.json [04:56:15] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [05:11:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P45161 and previous config saved to /var/cache/conftool/dbconfig/20230307-051113-marostegui.json [05:26:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P45162 and previous config saved to /var/cache/conftool/dbconfig/20230307-052620-marostegui.json [05:41:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T329203)', diff saved to https://phabricator.wikimedia.org/P45163 and previous config saved to /var/cache/conftool/dbconfig/20230307-054127-marostegui.json [05:41:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [05:41:35] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [05:41:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [05:41:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [05:41:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [05:41:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T329203)', diff saved to https://phabricator.wikimedia.org/P45164 and previous config saved to /var/cache/conftool/dbconfig/20230307-054153-marostegui.json [05:54:28] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 118 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:02:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T329203)', diff saved to https://phabricator.wikimedia.org/P45165 and previous config saved to /var/cache/conftool/dbconfig/20230307-060210-marostegui.json [06:02:18] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [06:07:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [06:13:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance [06:13:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance [06:13:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:13:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:17:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P45166 and previous config saved to /var/cache/conftool/dbconfig/20230307-061717-marostegui.json [06:19:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:20:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:25:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [06:25:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [06:26:54] (03PS3) 10Giuseppe Lavagetto: trafficserver: remove support for "php7.4" option in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867541 [06:27:47] (03PS1) 10Marostegui: mariadb: Remove db2095 [puppet] - 10https://gerrit.wikimedia.org/r/894771 (https://phabricator.wikimedia.org/T330975) [06:28:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2095.codfw.wmnet [06:29:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2095 [puppet] - 10https://gerrit.wikimedia.org/r/894771 (https://phabricator.wikimedia.org/T330975) (owner: 10Marostegui) [06:32:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P45167 and previous config saved to /var/cache/conftool/dbconfig/20230307-063223-marostegui.json [06:32:43] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:34:45] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2095.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [06:36:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2095.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [06:36:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:36:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2095.codfw.wmnet [06:37:44] 10ops-codfw, 10DBA, 10decommission-hardware: decommission db2095.codfw.wmnet - https://phabricator.wikimedia.org/T330975 (10Marostegui) a:05Marostegui→03None [06:37:47] 10ops-codfw, 10DBA, 10decommission-hardware: decommission db2095.codfw.wmnet - https://phabricator.wikimedia.org/T330975 (10Marostegui) This is ready for DC-Ops [06:37:58] 10ops-codfw, 10Data-Persistence (work done), 10decommission-hardware: decommission db2095.codfw.wmnet - https://phabricator.wikimedia.org/T330975 (10Marostegui) [06:38:17] 10SRE, 10ops-codfw, 10Data-Persistence (work done), 10decommission-hardware: decommission db2094.codfw.wmnet - https://phabricator.wikimedia.org/T330828 (10Marostegui) [06:39:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 37 hosts with reason: Schema change on s1 eqiad [06:40:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 37 hosts with reason: Schema change on s1 eqiad [06:40:39] (03CR) 10Giuseppe Lavagetto: mwdebug_deploy: remove configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche) [06:40:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: remove support for "php7.4" option in x-w-d [puppet] - 10https://gerrit.wikimedia.org/r/867541 (owner: 10Giuseppe Lavagetto) [06:42:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 34 hosts with reason: Schema change on s4 eqiad [06:42:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 34 hosts with reason: Schema change on s4 eqiad [06:43:04] !log dbmaint eqiad s1 T328817 [06:43:05] !log dbmaint eqiad s4 T328817 [06:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:09] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [06:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T329203)', diff saved to https://phabricator.wikimedia.org/P45168 and previous config saved to /var/cache/conftool/dbconfig/20230307-064730-marostegui.json [06:47:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2166.codfw.wmnet with reason: Maintenance [06:47:38] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [06:47:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2166.codfw.wmnet with reason: Maintenance [06:47:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T329203)', diff saved to https://phabricator.wikimedia.org/P45169 and previous config saved to /var/cache/conftool/dbconfig/20230307-064752-marostegui.json [06:49:35] (03CR) 10Elukey: [C: 03+2] httpbb: add tests for nsfw model on liftwing [puppet] - 10https://gerrit.wikimedia.org/r/894714 (owner: 10Ilias Sarantopoulos) [06:53:50] !log dbmaint eqiad s4 T329203 [06:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:56] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [06:54:08] !log dbmaint eqiad s1 T329203 [06:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T0700) [07:00:05] kormat, marostegui, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T0700). [07:09:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T329203)', diff saved to https://phabricator.wikimedia.org/P45170 and previous config saved to /var/cache/conftool/dbconfig/20230307-070923-marostegui.json [07:09:31] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [07:18:13] (03PS3) 10Slyngshede: P:IDM Minor fixes and restructure. [puppet] - 10https://gerrit.wikimedia.org/r/894527 [07:18:22] (03CR) 10Slyngshede: P:IDM Minor fixes and restructure. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/894527 (owner: 10Slyngshede) [07:23:09] 10SRE, 10LDAP-Access-Requests: Request access to the group ldap/wmf - https://phabricator.wikimedia.org/T331370 (10Aklapper) a:05NatHillard→03None [07:23:56] 10SRE, 10LDAP-Access-Requests: Request access to the group ldap/wmf - https://phabricator.wikimedia.org/T331370 (10Aklapper) 05Open→03Stalled Hi and welcome! Please see https://phabricator.wikimedia.org/tag/ldap-access-requests/ for data to include. Once provided, please change the task status to `open` vi... [07:24:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P45171 and previous config saved to /var/cache/conftool/dbconfig/20230307-072429-marostegui.json [07:24:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101 (s7,s8) T331381', diff saved to https://phabricator.wikimedia.org/P45172 and previous config saved to /var/cache/conftool/dbconfig/20230307-072454-root.json [07:25:01] T331381: decommission db1101.eqiad.wmnet - https://phabricator.wikimedia.org/T331381 [07:26:43] (03PS1) 10Marostegui: db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/894960 (https://phabricator.wikimedia.org/T331381) [07:27:20] (03CR) 10Marostegui: [C: 03+2] db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/894960 (https://phabricator.wikimedia.org/T331381) (owner: 10Marostegui) [07:29:44] (03PS1) 10Vgutierrez: hiera: Enable HAProxy systemd hardening in cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/894961 (https://phabricator.wikimedia.org/T323944) [07:31:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Row A switch maintenance T329073 [07:31:11] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [07:31:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Row A switch maintenance T329073 [07:31:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1115.eqiad.wmnet with reason: Row A switch maintenance T329073 [07:31:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1115.eqiad.wmnet with reason: Row A switch maintenance T329073 [07:31:48] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable HAProxy systemd hardening in cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/894961 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez) [07:32:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db[1151-1153].eqiad.wmnet with reason: Row A switch maintenance T329073 [07:32:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db[1151-1153].eqiad.wmnet with reason: Row A switch maintenance T329073 [07:33:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db[2142-2144].codfw.wmnet with reason: Row A switch maintenance T329073 [07:33:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db[2142-2144].codfw.wmnet with reason: Row A switch maintenance T329073 [07:34:00] !log enable haproxy systemd service unit hardening in cp4044 - T323944 [07:34:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 15 hosts with reason: Row A switch maintenance T329073 [07:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:05] T323944: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 [07:34:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 15 hosts with reason: Row A switch maintenance T329073 [07:38:48] (03PS1) 10Marostegui: mariadb: Move db1101 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/894962 (https://phabricator.wikimedia.org/T331382) [07:39:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P45174 and previous config saved to /var/cache/conftool/dbconfig/20230307-073936-marostegui.json [07:39:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1101 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/894962 (https://phabricator.wikimedia.org/T331382) (owner: 10Marostegui) [07:43:23] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:44:05] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:44:51] ^ expected [07:45:03] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [07:46:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] nodejs16: Add /bin/nodejs symlink [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/894687 (owner: 10Clément Goubert) [07:46:29] (03PS1) 10Marostegui: db1101: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/895064 [07:46:53] (03CR) 10Marostegui: [C: 03+2] db1101: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/895064 (owner: 10Marostegui) [07:47:41] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: deploy alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/894638 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [07:48:46] (03CR) 10Elukey: [C: 03+2] kserve: fix missing comma in yaml config [deployment-charts] - 10https://gerrit.wikimedia.org/r/894541 (https://phabricator.wikimedia.org/T331114) (owner: 10Elukey) [07:49:06] (03PS1) 10Elukey: ml-services: update docker images for kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/895065 (https://phabricator.wikimedia.org/T329032) [07:50:19] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [07:51:55] (03PS1) 10Marostegui: mariadb: Promote db1101 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/895066 (https://phabricator.wikimedia.org/T331384) [07:52:06] (03CR) 10Marostegui: [C: 04-2] "Wait for db1101 to be ready" [puppet] - 10https://gerrit.wikimedia.org/r/895066 (https://phabricator.wikimedia.org/T331384) (owner: 10Marostegui) [07:52:25] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:52:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [07:52:49] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:54:03] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:54:05] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:54:39] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:54:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T329203)', diff saved to https://phabricator.wikimedia.org/P45175 and previous config saved to /var/cache/conftool/dbconfig/20230307-075443-marostegui.json [07:54:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [07:54:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [07:54:50] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [07:54:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T329203)', diff saved to https://phabricator.wikimedia.org/P45176 and previous config saved to /var/cache/conftool/dbconfig/20230307-075453-marostegui.json [07:55:53] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:55:55] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:56:03] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:58:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] httpd: Let Puppet pick the init provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [07:59:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix PHP string interpolation [puppet] - 10https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy) [07:59:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] nagios: remove obsolete command check_all_memcached.php [puppet] - 10https://gerrit.wikimedia.org/r/885289 (owner: 10Giuseppe Lavagetto) [08:00:05] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T0800). [08:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:14] o/ [08:01:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add kubernetes102[3,4] to the wikikube-eqiad cluster 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/894697 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [08:02:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add kubernetes102[3,4] to the wikikube-eqiad cluster 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/894700 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [08:02:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add kubernetes102[3,4] to the wikikube-eqiad cluster 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/894701 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [08:04:25] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "👍" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895065 (https://phabricator.wikimedia.org/T329032) (owner: 10Elukey) [08:04:25] 10SRE, 10serviceops, 10Patch-For-Review: kubernetes102[34] implemetation tracking - https://phabricator.wikimedia.org/T313874 (10akosiaris) @jijiki, I +1ed the above, but we also lack a homer patch to instruct the routers to peer with the nodes. See https://gerrit.wikimedia.org/r/c/operations/homer/public/+/... [08:05:25] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update docker images for kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/895065 (https://phabricator.wikimedia.org/T329032) (owner: 10Elukey) [08:06:00] going to ship my cirrus patch [08:06:25] (03CR) 10Muehlenhoff: P:IDM Minor fixes and restructure. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894527 (owner: 10Slyngshede) [08:07:10] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-conf1003.eqiad.wmnet with OS bullseye [08:08:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dcausse@deploy2002 using scap backport" [extensions/CirrusSearch] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/894677 (https://phabricator.wikimedia.org/T331127) (owner: 10DCausse) [08:08:03] (03PS4) 10Slyngshede: P:IDM Minor fixes and restructure. [puppet] - 10https://gerrit.wikimedia.org/r/894527 [08:09:43] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad [08:10:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331384 [08:10:47] T331384: Switchover m3 master db1159 -> db1101 - https://phabricator.wikimedia.org/T331384 [08:10:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331384 [08:12:15] (03CR) 10Elukey: [C: 03+2] ml-services: update docker images for kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/895065 (https://phabricator.wikimedia.org/T329032) (owner: 10Elukey) [08:12:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [08:14:17] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:14:32] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:15:13] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:15:36] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:15:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T329203)', diff saved to https://phabricator.wikimedia.org/P45177 and previous config saved to /var/cache/conftool/dbconfig/20230307-081549-marostegui.json [08:15:56] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [08:16:15] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:16:29] (03PS2) 10Muehlenhoff: prometheus::mysqld_exporter: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/812239 [08:16:31] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:16:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331384 [08:16:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331384 [08:16:37] T331384: Switchover m3 master db1159 -> db1101 - https://phabricator.wikimedia.org/T331384 [08:17:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1101 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/895066 (https://phabricator.wikimedia.org/T331384) (owner: 10Marostegui) [08:18:08] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-codfw [08:18:43] I am going to switch over m3 (phabricator) database master, phabricator will be on read only for around 1 minute [08:19:35] !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-conf1003.eqiad.wmnet with reason: host reimage [08:20:06] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [08:20:18] !log Failover m3 from db1159 to db1101 - T331384 [08:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [08:21:34] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [08:21:35] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10MatthewVernon) d'oh, that seems likely, thank you! [yes we'll need a new storage schema in hosts.yaml, but that's when actually bringing into service,... [08:22:04] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-conf1003.eqiad.wmnet with reason: host reimage [08:22:17] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "As stated in the comments, we have different intervals in confd for a reason - please preserve them." [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [08:22:29] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:22:40] All done [08:22:58] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:23:12] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:23:23] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:23:23] (03Merged) 10jenkins-bot: Properly pass the page id on page moves [extensions/CirrusSearch] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/894677 (https://phabricator.wikimedia.org/T331127) (owner: 10DCausse) [08:23:28] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [08:23:38] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:23:51] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:24:25] !log dcausse@deploy2002 Started scap: Backport for [[gerrit:894677|Properly pass the page id on page moves (T331127)]] [08:24:31] T331127: phantom redirects lingering in incategory searches after page moves - https://phabricator.wikimedia.org/T331127 [08:25:41] (03PS1) 10Nicolas Fraison: hadoop: set quota init threads to speed up failover [puppet] - 10https://gerrit.wikimedia.org/r/895127 [08:25:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:26:50] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [08:28:23] (03PS1) 10Marostegui: db1159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/895128 [08:28:36] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:894677|Properly pass the page id on page moves (T331127)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:28:58] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [08:29:10] (03CR) 10Marostegui: [C: 03+2] db1159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/895128 (owner: 10Marostegui) [08:29:14] (03PS1) 10Muehlenhoff: sre.maps.roll-restart: Also restart PostgreSQL [cookbooks] - 10https://gerrit.wikimedia.org/r/895129 [08:30:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P45178 and previous config saved to /var/cache/conftool/dbconfig/20230307-083056-marostegui.json [08:30:57] (03PS2) 10Nicolas Fraison: hadoop: set quota init threads to speed up failover [puppet] - 10https://gerrit.wikimedia.org/r/895127 [08:30:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:31:29] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [08:32:38] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Again this is an attempt to make puppet use the discovery state in etcd. As I said multiple times, this is something we've always discoura" [puppet] - 10https://gerrit.wikimedia.org/r/893496 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [08:33:42] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 23 hosts with reason: Reinitialize eqiad with k8s 1.23 [08:33:46] (03CR) 10Nicolas Fraison: [C: 04-2] hive: Fix max metaspace size of hiveserver2 prod to 512m [puppet] - 10https://gerrit.wikimedia.org/r/894483 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [08:34:02] (03PS1) 10Marostegui: instances.yaml: Remove db1101 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/895130 (https://phabricator.wikimedia.org/T329352) [08:34:11] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 23 hosts with reason: Reinitialize eqiad with k8s 1.23 [08:34:17] (03CR) 10Nicolas Fraison: [C: 04-2] "Seems that some jobs where stopped on test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/894483 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [08:35:19] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1101 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/895130 (https://phabricator.wikimedia.org/T329352) (owner: 10Marostegui) [08:35:41] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39985/console" [puppet] - 10https://gerrit.wikimedia.org/r/887980 (https://phabricator.wikimedia.org/T328596) (owner: 10Aklapper) [08:35:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1101 from dbctl T329352', diff saved to https://phabricator.wikimedia.org/P45179 and previous config saved to /var/cache/conftool/dbconfig/20230307-083542-marostegui.json [08:35:53] T329352: decommission db1100.eqiad.wmnet - https://phabricator.wikimedia.org/T329352 [08:36:49] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] Remove redirect for pk.wikimedia.org (Pakistan) [puppet] - 10https://gerrit.wikimedia.org/r/887980 (https://phabricator.wikimedia.org/T328596) (owner: 10Aklapper) [08:41:00] !log dcausse@deploy2002 Finished scap: Backport for [[gerrit:894677|Properly pass the page id on page moves (T331127)]] (duration: 16m 34s) [08:41:06] T331127: phantom redirects lingering in incategory searches after page moves - https://phabricator.wikimedia.org/T331127 [08:42:23] !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-conf1003.eqiad.wmnet with OS bullseye [08:43:48] !log closing the UTC morning backport window [08:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:25] (03PS1) 10David Caro: cloud/cumin.aliases: update cloudvirt-codfw1 alias [puppet] - 10https://gerrit.wikimedia.org/r/895131 [08:46:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P45180 and previous config saved to /var/cache/conftool/dbconfig/20230307-084602-marostegui.json [08:46:18] (03CR) 10Slyngshede: P:IDM Minor fixes and restructure. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894527 (owner: 10Slyngshede) [08:48:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/894527 (owner: 10Slyngshede) [08:48:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812239 (owner: 10Muehlenhoff) [08:51:31] !log T331126 Scheduled 24H downtime for all wikikube eqiad hosts and all LVS services powered by the cluster [08:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:37] T331126: Update wikikube eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T331126 [08:52:37] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [08:53:57] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10elukey) [08:54:46] (03Abandoned) 10Giuseppe Lavagetto: utils: add script to sync abuse networks with conftool ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/767489 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [08:59:48] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [09:00:05] akosiaris: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Kubernetes upgrade wikikube eqiad deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T0900). [09:00:39] good bot [09:01:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T329203)', diff saved to https://phabricator.wikimedia.org/P45181 and previous config saved to /var/cache/conftool/dbconfig/20230307-090109-marostegui.json [09:01:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [09:01:16] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [09:01:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [09:01:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T329203)', diff saved to https://phabricator.wikimedia.org/P45182 and previous config saved to /var/cache/conftool/dbconfig/20230307-090130-marostegui.json [09:02:42] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=eqiad [09:02:51] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=blubberoid,name=eqiad [09:03:15] dcausse: ^ wikikube eqiad upgrade started, I just depooled wdqs in eqiad [09:03:31] akosiaris: thanks, will stop the flink job [09:06:21] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reimage for host kubetcd1004.eqiad.wmnet with OS bullseye [09:06:54] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reimage for host kubetcd1005.eqiad.wmnet with OS bullseye [09:07:12] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reimage for host kubetcd1006.eqiad.wmnet with OS bullseye [09:08:50] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:09:46] akosiaris: did you already stop k8s? [09:12:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PUT customresourcedefinitions) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:13:45] (JobUnavailable) firing: Reduced availability for job kubetcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:13:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [09:13:53] (03PS3) 10Alexandros Kosiaris: wikikube eqiad: Update cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/894586 (https://phabricator.wikimedia.org/T326617) [09:14:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin_ng: Update wikikube-eqiad settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/894591 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [09:14:46] !log installing PHP 7.4 security updates (as packaged in Debian Bullseye, not our internal build for Buster) [09:14:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [09:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:52] (03CR) 10Alexandros Kosiaris: wikikube eqiad: Update cluster settings for k8s 1.23 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/894586 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [09:16:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikikube eqiad: Update cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/894586 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [09:16:15] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] wikikube eqiad: Update cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/894586 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [09:16:20] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubetcd1005.eqiad.wmnet with reason: host reimage [09:16:22] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubetcd1004.eqiad.wmnet with reason: host reimage [09:16:30] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubetcd1006.eqiad.wmnet with reason: host reimage [09:17:58] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:18:50] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubetcd1005.eqiad.wmnet with reason: host reimage [09:19:13] (03Merged) 10jenkins-bot: admin_ng: Update wikikube-eqiad settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/894591 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [09:21:01] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6d 18h 53m 53s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [09:21:10] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:21:19] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubetcd1006.eqiad.wmnet with reason: host reimage [09:22:20] ouch the CirrusSearchJobQueueLagTooHigh alert is concerning, seems like changeprop consumer offsets were reset, looking [09:22:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T329203)', diff saved to https://phabricator.wikimedia.org/P45184 and previous config saved to /var/cache/conftool/dbconfig/20230307-092226-marostegui.json [09:22:33] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [09:22:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:23:45] (JobUnavailable) firing: Reduced availability for job k8s-api in k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:23:48] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubetcd1004.eqiad.wmnet with reason: host reimage [09:24:24] 10SRE, 10ops-esams, 10DC-Ops: Audit future knams power usage - https://phabricator.wikimedia.org/T331358 (10Peachey88) [09:24:48] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:23] (03PS3) 10Nicolas Fraison: hadoop: set quota init threads to speed up failover [puppet] - 10https://gerrit.wikimedia.org/r/895127 (https://phabricator.wikimedia.org/T310293) [09:27:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [09:31:01] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6d 18h 53m 53s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [09:31:48] PROBLEM - Juniper virtual chassis ports on asw2-a-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [09:33:38] RECOVERY - Juniper virtual chassis ports on asw2-a-eqiad is OK: OK: UP: 24 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [09:33:52] !log installing apr-util security updates on Bullseye [09:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:28] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10MatthewVernon) [09:35:18] (ProbeDown) firing: (7) Service eventgate-analytics:4592 has failed probes (http_eventgate-analytics_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:35:44] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubetcd1004.eqiad.wmnet with OS bullseye [09:36:10] (ProbeDown) firing: (19) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:36:18] PROBLEM - Maps HTTPS on maps1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:36:27] akosiaris: is that related? ^ [09:36:36] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 389 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 779, active_shards: 1170, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 389, delayed_unassigned_shards: 0, number_of_pending [09:36:36] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 75.04810776138551 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:36:44] PROBLEM - Maps HTTPS on maps1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:36:46] PROBLEM - Maps HTTPS on maps1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:36:48] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers kubernetes1022.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled: mw-web_4450: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, [09:36:48] tes1019.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled: eventgate-analytics-external_4692: Servers kubernetes1012.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1015. [09:36:48] net are marked down but pooled: eventstreams-internal_4992: Servers kubernetes1022.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kuber https://wikitech.wikimedia.org/wiki/PyBal [09:36:50] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 389 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 779, active_shards: 1170, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 389, delayed_unassigned_shards: 0, number_of_pending [09:36:50] eventgate analytics is, maps is not [09:36:50] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 75.04810776138551 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:36:56] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 389 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 779, active_shards: 1170, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 389, delayed_unassigned_shards: 0, number_of_pending [09:36:56] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 75.04810776138551 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:37:06] and ofc I forgot to downtime the actual LVS servers :-( [09:37:24] PROBLEM - Maps HTTPS on maps1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:37:28] PROBLEM - Maps HTTPS on maps1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:37:30] right [09:37:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P45186 and previous config saved to /var/cache/conftool/dbconfig/20230307-093732-marostegui.json [09:37:52] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers kubernetes1012.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet are marked down but pooled: mw-web_4450: Servers kubernetes1022.eqiad.wmnet, kubernetes1019.eqiad.wmnet, [09:37:52] tes1007.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled: eventgate-analytics-external_4692: Servers kubernetes1022.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1006. [09:37:52] net are marked down but pooled: eventstreams-internal_4992: Servers kubernetes1008.eqiad.wmnet, kubernetes1022.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kuber https://wikitech.wikimedia.org/wiki/PyBal [09:37:54] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 389 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 779, active_shards: 1170, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 389, delayed_unassigned_shards: 0, number_of_pending [09:37:55] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 75.04810776138551 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:38:04] fixing [09:38:23] <_joe_> marostegui: I'm sorry, we're letting a manager run production upgrades. [09:38:27] XD [09:38:35] Should I resolve the alert manually then? [09:38:40] <_joe_> it's on us, I apologize [09:38:45] (JobUnavailable) firing: (3) Reduced availability for job k8s-api in k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:38:53] <_joe_> marostegui: let's schedule a meeting where we can decide how to proceed on that [09:39:09] _joe_: do you open the google doc or should I? [09:39:28] !log schedule downtime for PyBal backends health on lvs1019, lvs1020 [09:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:07] <_joe_> marostegui: I think we can do better - I'll create a miro board [09:51:07] _joe_: and some slides to show how that'd work [09:51:07] akosiaris: I'll issue a downtime for jobunavailable alerts on k8s [09:51:07] effie: maps hosts ^ complaining [09:51:07] (ProbeDown) firing: (6) Service eventgate-analytics:4592 has failed probes (http_eventgate-analytics_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:51:07] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:51:07] on it [09:51:07] thanks [09:51:07] (JobUnavailable) firing: (7) Reduced availability for job swagger_check_citoid_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:51:07] PROBLEM - Check systemd state on ms-be1044 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:07] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1044 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:51:07] (03PS1) 10Ottomata: Install spark3 and conda-analytics on all analytics cluster airflow nodes [puppet] - 10https://gerrit.wikimedia.org/r/895135 (https://phabricator.wikimedia.org/T331345) [09:51:07] (JobUnavailable) firing: (7) Reduced availability for job kubetcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:51:07] (ProbeDown) resolved: (3) Service mathoid:4001 has failed probes (http_mathoid_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:52:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P45187 and previous config saved to /var/cache/conftool/dbconfig/20230307-095239-marostegui.json [09:53:45] (JobUnavailable) firing: (6) Reduced availability for job swagger_check_citoid_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:54:17] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubetcd1006.eqiad.wmnet with OS bullseye [09:57:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10Miriam) Hi @MatthewVernon sorry for the delay here. The account should expire on Jun 30th and I am the contact for this. Thank you again for all the work! [09:58:00] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Apache/IDM [puppet] - 10https://gerrit.wikimedia.org/r/895136 (https://phabricator.wikimedia.org/T135991) [09:58:45] (JobUnavailable) firing: (7) Reduced availability for job swagger_check_citoid_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:02:34] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10MatthewVernon) [10:02:59] (03CR) 10Kosta Harlan: [C: 03+2] GrowthExperiments: Make new impact module default on betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894594 (https://phabricator.wikimedia.org/T328757) (owner: 10Kosta Harlan) [10:03:13] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/895136 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:03:45] (JobUnavailable) firing: (7) Reduced availability for job swagger_check_citoid_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:03:59] (03Merged) 10jenkins-bot: GrowthExperiments: Make new impact module default on betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894594 (https://phabricator.wikimedia.org/T328757) (owner: 10Kosta Harlan) [10:04:20] (03CR) 10Ottomata: "Commented about this on task https://phabricator.wikimedia.org/T327970#8671769" [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [10:05:18] (ProbeDown) firing: (3) Service mathoid:4001 has failed probes (http_mathoid_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:05:29] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubetcd1005.eqiad.wmnet with OS bullseye [10:05:41] akosiaris effie ^ is that still related? [10:05:58] marostegui: mathoid? yes [10:06:10] ok, acked it [10:06:29] ok, what did I miss and failed to properly schedule downtime for these ? [10:07:15] last one was for recommendation-api [10:07:18] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reimage for host kubemaster1001.eqiad.wmnet with OS bullseye [10:07:42] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reimage for host kubemaster1002.eqiad.wmnet with OS bullseye [10:07:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T329203)', diff saved to https://phabricator.wikimedia.org/P45188 and previous config saved to /var/cache/conftool/dbconfig/20230307-100745-marostegui.json [10:07:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2181.codfw.wmnet with reason: Maintenance [10:07:52] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:08:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2181.codfw.wmnet with reason: Maintenance [10:08:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T329203)', diff saved to https://phabricator.wikimedia.org/P45189 and previous config saved to /var/cache/conftool/dbconfig/20230307-100807-marostegui.json [10:08:45] (JobUnavailable) firing: (7) Reduced availability for job swagger_check_citoid_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:10:13] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Apache/IDM [puppet] - 10https://gerrit.wikimedia.org/r/895136 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:10:18] (ProbeDown) resolved: (2) Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:11:04] (03PS1) 10MVernon: Add user nickifeajika (analytics-privatedata-users, krb) [puppet] - 10https://gerrit.wikimedia.org/r/895137 (https://phabricator.wikimedia.org/T331277) [10:15:34] (03PS5) 10Slyngshede: P:IDM Minor fixes and restructure. [puppet] - 10https://gerrit.wikimedia.org/r/894527 [10:16:34] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubemaster1001.eqiad.wmnet with reason: host reimage [10:16:37] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubemaster1002.eqiad.wmnet with reason: host reimage [10:16:56] (03PS1) 10JMeybohm: kubernetes__deployment_server: Switch to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/895138 (https://phabricator.wikimedia.org/T307943) [10:17:23] (03PS2) 10JMeybohm: kubernetes_deployment_server: Switch to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/895138 (https://phabricator.wikimedia.org/T307943) [10:18:07] (03CR) 10Slyngshede: [C: 03+2] P:IDM Minor fixes and restructure. [puppet] - 10https://gerrit.wikimedia.org/r/894527 (owner: 10Slyngshede) [10:19:05] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubemaster1001.eqiad.wmnet with reason: host reimage [10:19:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39986/console" [puppet] - 10https://gerrit.wikimedia.org/r/895138 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:21:38] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubemaster1002.eqiad.wmnet with reason: host reimage [10:22:50] RECOVERY - Maps HTTPS on maps1008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.218 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:24:47] looks like the maps outages were it being unable to connect to tegola in eqiad, although the logging was not very helpful in that regard [10:25:14] (03PS1) 10Slyngshede: P:idm deployment no longer needs production variable. [puppet] - 10https://gerrit.wikimedia.org/r/895139 [10:27:12] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39987/console" [puppet] - 10https://gerrit.wikimedia.org/r/895139 (owner: 10Slyngshede) [10:27:43] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm deployment no longer needs production variable. [puppet] - 10https://gerrit.wikimedia.org/r/895139 (owner: 10Slyngshede) [10:28:37] !log apt1001: pull latest packages for thirdparty/kubeadm-k8s-1-22 buster-wikimedia (T286856) [10:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:44] T286856: Upgrade Toolforge Kubernetes to latest 1.22 - https://phabricator.wikimedia.org/T286856 [10:29:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T329203)', diff saved to https://phabricator.wikimedia.org/P45190 and previous config saved to /var/cache/conftool/dbconfig/20230307-102901-marostegui.json [10:29:08] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:29:34] (03CR) 10Slyngshede: [C: 03+2] LOGIN: Add custom WikiMedia SSO login page. [software/bitu] - 10https://gerrit.wikimedia.org/r/891797 (owner: 10Slyngshede) [10:29:37] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] LOGIN: Add custom WikiMedia SSO login page. [software/bitu] - 10https://gerrit.wikimedia.org/r/891797 (owner: 10Slyngshede) [10:33:52] (03PS1) 10MVernon: hiera: use a regex to specify new-style storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/895141 (https://phabricator.wikimedia.org/T308677) [10:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET leases) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:36:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/894729 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [10:37:17] (03PS3) 10JMeybohm: profile::kubernetes::client: Switch to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/895138 (https://phabricator.wikimedia.org/T307943) [10:38:56] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubemaster1002.eqiad.wmnet with OS bullseye [10:38:56] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895129 (owner: 10Muehlenhoff) [10:38:58] (KubernetesCalicoDown) firing: (5) kubernetes1009.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:39:15] (03CR) 10Jbond: [C: 03+1] cloud/cumin.aliases: update cloudvirt-codfw1 alias [puppet] - 10https://gerrit.wikimedia.org/r/895131 (owner: 10David Caro) [10:39:47] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubemaster1001.eqiad.wmnet with OS bullseye [10:39:54] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [10:39:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET leases) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:40:14] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [10:40:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/895137 (https://phabricator.wikimedia.org/T331277) (owner: 10MVernon) [10:40:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/895137 (https://phabricator.wikimedia.org/T331277) (owner: 10MVernon) [10:41:42] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39988/console" [puppet] - 10https://gerrit.wikimedia.org/r/895138 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:43:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET leases) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:43:45] (JobUnavailable) firing: Reduced availability for job k8s-pods in k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:44:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P45191 and previous config saved to /var/cache/conftool/dbconfig/20230307-104408-marostegui.json [10:48:15] (03PS4) 10JMeybohm: profile::kubernetes::client: Switch to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/895138 (https://phabricator.wikimedia.org/T307943) [10:48:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET leases) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:48:31] !log apt2001: pull latest packages for thirdparty/kubeadm-k8s-1-22 buster-wikimedia (T286856) [10:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:37] T286856: Upgrade Toolforge Kubernetes to latest 1.22 - https://phabricator.wikimedia.org/T286856 [10:48:44] (03CR) 10David Caro: [C: 03+2] cloud/cumin.aliases: update cloudvirt-codfw1 alias [puppet] - 10https://gerrit.wikimedia.org/r/895131 (owner: 10David Caro) [10:50:48] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39989/console" [puppet] - 10https://gerrit.wikimedia.org/r/895138 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:51:07] !log manually label kubemaster1001, kubemaster1002 giving them role master T307943 [10:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:13] T307943: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 [10:52:01] (03PS1) 10Muehlenhoff: logstash: Enable profile::auto_restarts::service for apache2-htcacheclean [puppet] - 10https://gerrit.wikimedia.org/r/895144 (https://phabricator.wikimedia.org/T135991) [10:52:15] (03PS10) 10EoghanGaffney: Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 [10:53:53] (03PS4) 10Jbond: profile::confd: add a confd profile [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) [10:53:54] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1007.eqiad.wmnet with OS bullseye [10:54:23] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes1005.eqiad.wmnet with OS bullseye [10:54:48] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes1006.eqiad.wmnet with OS bullseye [10:55:39] (03PS1) 10Muehlenhoff: arclamp: Enable profile::auto_restarts::service for apache2-htcacheclean [puppet] - 10https://gerrit.wikimedia.org/r/895145 (https://phabricator.wikimedia.org/T135991) [10:55:42] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes1015.eqiad.wmnet with OS bullseye [10:56:03] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes1016.eqiad.wmnet with OS bullseye [10:57:17] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1008.eqiad.wmnet with OS bullseye [10:57:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39990/console" [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [10:58:20] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 119 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:58:25] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [10:58:53] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1009.eqiad.wmnet with OS bullseye [10:58:57] (03PS1) 10Muehlenhoff: IDM: Enable profile::auto_restarts::service for apache2-htcacheclean [puppet] - 10https://gerrit.wikimedia.org/r/895166 (https://phabricator.wikimedia.org/T135991) [10:59:00] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1010.eqiad.wmnet with OS bullseye [10:59:08] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1011.eqiad.wmnet with OS bullseye [10:59:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P45192 and previous config saved to /var/cache/conftool/dbconfig/20230307-105914-marostegui.json [10:59:19] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1012.eqiad.wmnet with OS bullseye [11:00:13] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1013.eqiad.wmnet with OS bullseye [11:00:18] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1014.eqiad.wmnet with OS bullseye [11:01:00] (03PS4) 10Hnowlan: helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) [11:01:12] (03PS5) 10Jbond: profile::confd: add a confd profile [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) [11:01:33] (03CR) 10MVernon: [C: 03+2] Add user nickifeajika (analytics-privatedata-users, krb) [puppet] - 10https://gerrit.wikimedia.org/r/895137 (https://phabricator.wikimedia.org/T331277) (owner: 10MVernon) [11:02:52] (03CR) 10Hnowlan: [C: 03+1] "lgtm - tileratorui will be removed soon but I will clean up this reference when that happens." [cookbooks] - 10https://gerrit.wikimedia.org/r/895129 (owner: 10Muehlenhoff) [11:03:10] (03CR) 10Jbond: profile::confd: add a confd profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [11:04:00] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39992/console" [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney) [11:04:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 6 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39991/console" [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [11:05:11] (03CR) 10CI reject: [V: 04-1] helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [11:05:23] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubernetes1016.eqiad.wmnet with OS bullseye [11:05:59] fail, oh the joy [11:06:29] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1005.eqiad.wmnet with reason: host reimage [11:06:33] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1006.eqiad.wmnet with reason: host reimage [11:06:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon @nickifeajika all done now. [11:07:18] (03CR) 10Elukey: [C: 03+2] profile::service_proxy::envoy: add support for inference [puppet] - 10https://gerrit.wikimedia.org/r/894014 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [11:08:00] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1007.eqiad.wmnet with reason: host reimage [11:08:41] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/895141 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [11:08:45] (JobUnavailable) firing: (3) Reduced availability for job dragonfly_dfdaemon in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:09:36] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1005.eqiad.wmnet with reason: host reimage [11:11:19] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1008.eqiad.wmnet with reason: host reimage [11:11:36] (03PS1) 10Filippo Giunchedi: search-platform: deploy rdf streaming updater alerts globally [alerts] - 10https://gerrit.wikimedia.org/r/895168 (https://phabricator.wikimedia.org/T309182) [11:12:08] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1007.eqiad.wmnet with reason: host reimage [11:12:31] 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Radar): git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10hashar) >>! In T325128#8667625, @MatthewVernon wrote: > @hashar are there still things that... [11:12:53] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1009.eqiad.wmnet with reason: host reimage [11:12:55] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1010.eqiad.wmnet with reason: host reimage [11:12:59] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1011.eqiad.wmnet with reason: host reimage [11:13:11] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1012.eqiad.wmnet with reason: host reimage [11:13:30] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney) [11:13:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/895144 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:13:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/895145 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:14:09] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1013.eqiad.wmnet with reason: host reimage [11:14:13] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1014.eqiad.wmnet with reason: host reimage [11:14:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T329203)', diff saved to https://phabricator.wikimedia.org/P45193 and previous config saved to /var/cache/conftool/dbconfig/20230307-111421-marostegui.json [11:14:26] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1006.eqiad.wmnet with reason: host reimage [11:14:28] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [11:15:34] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:15:59] (03PS1) 10JMeybohm: cert-manager: Set priorityClassName by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/895169 (https://phabricator.wikimedia.org/T310618) [11:16:01] (03PS1) 10JMeybohm: admin_ng: Remove warning comment about allowCriticalPods [deployment-charts] - 10https://gerrit.wikimedia.org/r/895170 (https://phabricator.wikimedia.org/T310618) [11:16:03] (03PS1) 10JMeybohm: custom_deploy: Set priorityClass for istio in ml and dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/895171 (https://phabricator.wikimedia.org/T310618) [11:16:12] ftr, wikidata query service lag has been steadily rising since ca. 9:05 UTC on all eqiad servers (wdqs*) – I don’t know if that’s related to any other ongoing issues [11:16:16] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&from=now-12h&to=now&viewPanel=8 [11:16:18] (03CR) 10Muehlenhoff: [C: 03+2] sre.maps.roll-restart: Also restart PostgreSQL [cookbooks] - 10https://gerrit.wikimedia.org/r/895129 (owner: 10Muehlenhoff) [11:16:26] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:16:42] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/895141 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [11:16:45] s/wdqs*/wdqs1*/, sorry – the wdqs2* ones seem to be fine [11:17:11] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1012.eqiad.wmnet with reason: host reimage [11:17:30] Lucas_WMDE: the flink updater isn't running currently for eqiad and eqiad wdqs is depooled [11:17:40] we are upgrading the wikikube eqiad cluster [11:17:43] ok [11:17:49] https://phabricator.wikimedia.org/T331126 [11:18:19] got it, thanks [11:18:22] good luck with the upgrade then [11:18:27] it's expected that once we deploy flink updater in eqiad again, we will cover the lost ground, catch up and then repool wdqs in eqiad [11:18:37] (03PS5) 10Hnowlan: helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) [11:19:11] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1013.eqiad.wmnet with reason: host reimage [11:19:11] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1011.eqiad.wmnet with reason: host reimage [11:19:20] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1010.eqiad.wmnet with reason: host reimage [11:20:13] Lucas_WMDE: eqiad servers are not pooled so I wonder why max lag replication is counting eqiad [11:20:15] but I wonder if there’s some place where it’s still pooled – I can still see wdqs1* in the API maxlag https://www.wikidata.org/w/api.php?action=query&format=json&maxlag=-1 [11:20:20] jinx ^^ [11:20:21] (https://grafana-rw.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m) [11:20:23] (03CR) 10Slyngshede: [C: 03+2] IDM: Enable profile::auto_restarts::service for apache2-htcacheclean [puppet] - 10https://gerrit.wikimedia.org/r/895166 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:20:30] :) [11:20:36] I thought https://phabricator.wikimedia.org/T238751 was done, so the maxlag isn’t *supposed* to take depooled servers into account anymore [11:20:41] (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/895166 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:20:43] but I know basically nothing about how that was implemented I’m afraid [11:20:46] it worked last time... [11:20:50] (03PS1) 10Vgutierrez: hiera: Enable ESI testing in cp3064 [puppet] - 10https://gerrit.wikimedia.org/r/895172 (https://phabricator.wikimedia.org/T308799) [11:20:55] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for s-mukuti - https://phabricator.wikimedia.org/T331402 (10S_Mukuti) [11:20:59] when codfw was upgraded... [11:21:19] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1008.eqiad.wmnet with reason: host reimage [11:21:28] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for s-mukuti - https://phabricator.wikimedia.org/T331402 (10S_Mukuti) @KSiebert [11:21:34] 10SRE, 10serviceops, 10Patch-For-Review: kubernetes102[34] implemetation tracking - https://phabricator.wikimedia.org/T313874 (10JMeybohm) [11:21:36] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1009.eqiad.wmnet with reason: host reimage [11:21:51] dcausse I wonder if last time it worked because the maxlag is based on median + 1? [11:22:07] it looks like there are twice as many eqiad servers than codfw ones, so the median might still have been fine then [11:22:19] isn't it using the most lagged pooled server? [11:22:51] oh, https://phabricator.wikimedia.org/T322030 isn’t done yet [11:22:56] so it sounds like it is most lagged at the moment [11:23:02] you just told us it should be median [11:23:05] and that’s what I remembered [11:23:10] PROBLEM - Check for large files in client bucket on kubernetes1013 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.229. Check system logs on 10.64.48.229 https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [11:23:13] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubernetes1006.eqiad.wmnet with OS bullseye [11:23:30] PROBLEM - dhclient process on kubernetes1011 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.134: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [11:23:30] PROBLEM - Check size of conntrack table on kubernetes1013 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.229: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [11:23:36] PROBLEM - confd service on kubernetes1011 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.134: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:23:45] (JobUnavailable) firing: (3) Reduced availability for job dragonfly_dfdaemon in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:23:55] (03PS1) 10Filippo Giunchedi: o11y: scope alerts deployment for main sites [alerts] - 10https://gerrit.wikimedia.org/r/895173 (https://phabricator.wikimedia.org/T309182) [11:24:20] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1014.eqiad.wmnet with reason: host reimage [11:24:26] (03CR) 10Elukey: [C: 03+1] "Thanks :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895171 (https://phabricator.wikimedia.org/T310618) (owner: 10JMeybohm) [11:24:44] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:24:46] (03CR) 10Elukey: [C: 03+1] cert-manager: Set priorityClassName by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/895169 (https://phabricator.wikimedia.org/T310618) (owner: 10JMeybohm) [11:24:58] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10aborrero) [11:24:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:25:06] PROBLEM - puppet last run on kubernetes1011 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.134. Check system logs on 10.64.32.134 https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:25:06] PROBLEM - Check systemd state on kubernetes1013 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.229: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:10] (03CR) 10Filippo Giunchedi: [C: 03+1] arclamp: Enable profile::auto_restarts::service for apache2-htcacheclean [puppet] - 10https://gerrit.wikimedia.org/r/895145 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:25:10] (03CR) 10Elukey: [C: 03+1] admin_ng: Remove warning comment about allowCriticalPods [deployment-charts] - 10https://gerrit.wikimedia.org/r/895170 (https://phabricator.wikimedia.org/T310618) (owner: 10JMeybohm) [11:25:12] (03CR) 10Filippo Giunchedi: [C: 03+2] search-platform: deploy rdf streaming updater alerts globally [alerts] - 10https://gerrit.wikimedia.org/r/895168 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [11:25:38] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10aborrero) [11:25:54] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: scope alerts deployment for main sites [alerts] - 10https://gerrit.wikimedia.org/r/895173 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [11:26:03] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10aborrero) Sent a ping to @Marostegui regarding clouddb[1013-1014,1021] Also @Andrew regarding cloudservices host, but I think the host can be taken down... [11:26:10] PROBLEM - puppet last run on kubernetes1008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.218: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:26:54] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1013 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.229: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [11:27:14] RECOVERY - Check size of conntrack table on kubernetes1013 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [11:27:22] RECOVERY - confd service on kubernetes1011 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:27:26] PROBLEM - Check for large files in client bucket on kubernetes1008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.218: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [11:27:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [11:27:44] RECOVERY - Check systemd state on kubernetes1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:06] RECOVERY - Check for large files in client bucket on kubernetes1013 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [11:28:08] Lucas_WMDE: but last time there was an issue with systemd timers so perhaps this caused this script to stop functioning? [11:28:21] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1013.eqiad.wmnet with OS bullseye [11:28:39] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubernetes1005.eqiad.wmnet with OS bullseye [11:28:41] and this would have caused the same problem maybe without this timer issue? [11:28:45] (JobUnavailable) resolved: (3) Reduced availability for job dragonfly_dfdaemon in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:28:46] PROBLEM - Disk space on kubernetes1009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.16.188: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kubernetes1009&var-datasource=eqiad+prometheus/ops [11:29:25] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1007.eqiad.wmnet with OS bullseye [11:29:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:30:18] RECOVERY - Check for large files in client bucket on kubernetes1008 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [11:30:33] (03CR) 10JMeybohm: [C: 03+2] custom_deploy: Set priorityClass for istio in ml and dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/895171 (https://phabricator.wikimedia.org/T310618) (owner: 10JMeybohm) [11:30:35] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Remove warning comment about allowCriticalPods [deployment-charts] - 10https://gerrit.wikimedia.org/r/895170 (https://phabricator.wikimedia.org/T310618) (owner: 10JMeybohm) [11:30:37] (03CR) 10JMeybohm: [C: 03+2] cert-manager: Set priorityClassName by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/895169 (https://phabricator.wikimedia.org/T310618) (owner: 10JMeybohm) [11:30:44] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable ESI testing in cp3064 [puppet] - 10https://gerrit.wikimedia.org/r/895172 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [11:31:42] RECOVERY - puppet last run on kubernetes1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:32:20] PROBLEM - Host kubernetes1011 is DOWN: PING CRITICAL - Packet loss = 100% [11:32:45] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) @aborrero regarding clouddb* hosts, it is up to your team but I think it would be nice if you could depool them. Better user experience for s... [11:33:22] RECOVERY - Host kubernetes1011 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [11:33:27] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1012.eqiad.wmnet with OS bullseye [11:34:26] PROBLEM - Host kubernetes1008 is DOWN: PING CRITICAL - Packet loss = 100% [11:35:04] RECOVERY - Host kubernetes1008 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [11:35:04] dcausse: `curl -H 'Accept: application/json' 'http://lvs2009:9090/pools/wdqs_80' | jq .` looks correct to me, at least, and I think that’s what the code is looking at [11:35:13] where only wdqs2* are pooled+enabled+up [11:35:18] but I notice wdqs1* aren’t in the output at all [11:35:19] (03Merged) 10jenkins-bot: cert-manager: Set priorityClassName by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/895169 (https://phabricator.wikimedia.org/T310618) (owner: 10JMeybohm) [11:35:25] rather than being reported as pooled: false or anything [11:35:32] so I wonder if it’s using cached information for those… [11:35:40] (03CR) 10Btullis: [C: 03+1] "Thank you." [deployment-charts] - 10https://gerrit.wikimedia.org/r/895171 (https://phabricator.wikimedia.org/T310618) (owner: 10JMeybohm) [11:35:54] (that `curl` was on mwdebu2002 btw) [11:36:10] RECOVERY - dhclient process on kubernetes1011 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [11:36:10] RECOVERY - puppet last run on kubernetes1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:36:17] that would explain but I'm not very familiar with this script :/ [11:36:28] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1011.eqiad.wmnet with OS bullseye [11:37:02] PROBLEM - Host kubernetes1009 is DOWN: PING CRITICAL - Packet loss = 100% [11:37:31] (03CR) 10Jbond: [C: 04-1] "-1: looks goof but use a function instead of a global to get the users (see inline)" [puppet] - 10https://gerrit.wikimedia.org/r/894744 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [11:37:43] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubernetes1015.eqiad.wmnet with OS bullseye [11:37:45] (03Merged) 10jenkins-bot: admin_ng: Remove warning comment about allowCriticalPods [deployment-charts] - 10https://gerrit.wikimedia.org/r/895170 (https://phabricator.wikimedia.org/T310618) (owner: 10JMeybohm) [11:37:47] (03Merged) 10jenkins-bot: custom_deploy: Set priorityClass for istio in ml and dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/895171 (https://phabricator.wikimedia.org/T310618) (owner: 10JMeybohm) [11:38:00] RECOVERY - Disk space on kubernetes1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kubernetes1009&var-datasource=eqiad+prometheus/ops [11:38:00] if lvs2009 had wdqs1* hosts defined, we would be having major configuration issues [11:38:02] RECOVERY - Host kubernetes1009 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [11:38:13] hm, the code looks correct to me so far [11:38:15] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1009.eqiad.wmnet with OS bullseye [11:38:22] it !array_key_exists then return false (not pooled) [11:38:26] I’ll keep looking… [11:38:29] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1010.eqiad.wmnet with OS bullseye [11:38:55] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1008.eqiad.wmnet with OS bullseye [11:40:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/895141 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [11:40:28] curl -H 'Accept: application/json' 'http://lvs1019:9090/pools/wdqs_80' | jq . [11:40:30] on mwdebug1002 [11:40:37] says that all the wdqs1013 servers are pooled+enabled+up [11:41:05] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] Split out a new class for phabricator::config [puppet] - 10https://gerrit.wikimedia.org/r/891841 (owner: 10EoghanGaffney) [11:41:08] and updateQueryServiceLag.php seems to run with `--lb lvs1019:9090 --lb lvs2009:9090` so I think it’ll query both of them [11:41:26] sorry, not true, it doesn’t say *all* the wdqs1 servers are pooled+enabled+up [11:41:27] but some are [11:41:31] (03CR) 10MVernon: [C: 03+2] hiera: use a regex to specify new-style storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/895141 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [11:41:52] wdqs1013 is the "internal" cluster [11:41:55] (03PS1) 10Effie Mouzeli: Add kubernetes102[3,4] to its k8s_neighbors list 0/3 [homer/public] - 10https://gerrit.wikimedia.org/r/895175 (https://phabricator.wikimedia.org/T313874) [11:42:04] should have been depooled too [11:42:05] wdqs1004 and 1006 are pooled false, enabled false [11:42:05] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1014.eqiad.wmnet with OS bullseye [11:42:12] but the others are pooled true enabled true [11:43:26] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes1016.eqiad.wmnet with OS bullseye [11:43:47] wait no wdqs1013 is the public cluster [11:44:32] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:44:54] it has been depooled via dns discovery so perhaps these are too unrelated things? [11:45:01] s/too/two [11:45:06] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:45:47] (03CR) 10Muehlenhoff: [C: 03+2] arclamp: Enable profile::auto_restarts::service for apache2-htcacheclean [puppet] - 10https://gerrit.wikimedia.org/r/895145 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:46:21] (03CR) 10Muehlenhoff: [C: 03+2] IDM: Enable profile::auto_restarts::service for apache2-htcacheclean [puppet] - 10https://gerrit.wikimedia.org/r/895166 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:47:00] okay, mwdebug2002 can’t resolve lvs1019 but mwmaint2002 can [11:47:00] I can depool them individually but it seems that the script should look at the dns discovery status too (not sure how to do that tho)? [11:47:01] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2070.codfw.wmnet with OS bullseye [11:47:11] and the script runs on mwmaint2002, so I assume that’s the reason [11:47:12] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, and 2 others: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2070.codfw.wmnet with OS bullseye [11:47:25] Lucas_WMDE: try with a FQDN? [11:48:12] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:48:44] PROBLEM - Maps HTTPS on maps1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:48:45] (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:49:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one remaining typo inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) (owner: 10Slyngshede) [11:51:08] (03CR) 10Muehlenhoff: [C: 03+2] libraryupgrader: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/894063 (owner: 10Muehlenhoff) [11:51:25] (03PS1) 10JMeybohm: cfssl-issuer: Set priorityClassName system-cluster-critical [deployment-charts] - 10https://gerrit.wikimedia.org/r/895179 (https://phabricator.wikimedia.org/T310618) [11:52:14] RECOVERY - Host ms-be2070 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [11:52:35] depooled all servers manually [11:52:49] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, 10wdwb-tech: Depooled servers may still be taken into account for query service maxlag - https://phabricator.wikimedia.org/T331405 (10Lucas_Werkmeister_WMDE) [11:52:50] dcausse: I filed T331405, probably needs some rephrasing from people who know how all this works ^^ [11:52:51] T331405: Depooled servers may still be taken into account for query service maxlag - https://phabricator.wikimedia.org/T331405 [11:52:54] and thanks! [11:53:13] API maxlag looks good now [11:53:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add kubernetes102[3,4] to its k8s_neighbors list 0/3 [homer/public] - 10https://gerrit.wikimedia.org/r/895175 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [11:53:39] (03CR) 10JMeybohm: [C: 03+1] Add kubernetes102[3,4] to its k8s_neighbors list 0/3 [homer/public] - 10https://gerrit.wikimedia.org/r/895175 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [11:53:54] Lucas_WMDE: thanks for investigation and the task! [11:54:24] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, 10wdwb-tech: Depooled servers may still be taken into account for query service maxlag - https://phabricator.wikimedia.org/T331405 (10Lucas_Werkmeister_WMDE) [11:54:55] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes1015.eqiad.wmnet with OS bullseye [11:56:00] (03CR) 10Effie Mouzeli: [C: 03+2] Add kubernetes102[3,4] to its k8s_neighbors list 0/3 [homer/public] - 10https://gerrit.wikimedia.org/r/895175 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [11:56:25] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, 10wdwb-tech: Depooled servers may still be taken into account for query service maxlag - https://phabricator.wikimedia.org/T331405 (10Lucas_Werkmeister_WMDE) [11:56:37] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Add kubernetes102[3,4] to its k8s_neighbors list 0/3 [homer/public] - 10https://gerrit.wikimedia.org/r/895175 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [11:56:40] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1016.eqiad.wmnet with reason: host reimage [11:56:42] (03Merged) 10jenkins-bot: Add kubernetes102[3,4] to its k8s_neighbors list 0/3 [homer/public] - 10https://gerrit.wikimedia.org/r/895175 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [11:56:56] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.32:443, 10.2.2.32:8888, 10.2.2.32:80]) https://wikitech.wikimedia.org/wiki/PyBal [11:57:44] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1013 is OK: OK: synced at Tue 2023-03-07 11:57:42 UTC. https://wikitech.wikimedia.org/wiki/NTP [11:58:40] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 2 others: Depooled servers may still be taken into account for query service maxlag - https://phabricator.wikimedia.org/T331405 (10Lucas_Werkmeister_WMDE) Tagging this as an incident follow-up – while the maxlag was high, edits slowed down d... [11:59:12] (03CR) 10JMeybohm: [C: 03+2] cfssl-issuer: Set priorityClassName system-cluster-critical [deployment-charts] - 10https://gerrit.wikimedia.org/r/895179 (https://phabricator.wikimedia.org/T310618) (owner: 10JMeybohm) [11:59:52] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1016.eqiad.wmnet with reason: host reimage [11:59:56] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.32:443, 10.2.2.32:8888, 10.2.2.32:80]) https://wikitech.wikimedia.org/wiki/PyBal [12:00:23] !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:01:00] !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:01:06] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:01:46] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:02:03] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10Bugreporter) See also: {T226453} {T153563} [12:03:12] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1017.eqiad.wmnet with OS bullseye [12:03:32] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2070.codfw.wmnet with reason: host reimage [12:04:55] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:05:01] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:05:39] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:05:43] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:06:00] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:06:02] (03Merged) 10jenkins-bot: cfssl-issuer: Set priorityClassName system-cluster-critical [deployment-charts] - 10https://gerrit.wikimedia.org/r/895179 (https://phabricator.wikimedia.org/T310618) (owner: 10JMeybohm) [12:06:08] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:06:10] !log jayme@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:06:19] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:06:36] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1015.eqiad.wmnet with reason: host reimage [12:06:44] !log jayme@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:06:47] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2070.codfw.wmnet with reason: host reimage [12:07:43] !log jayme@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:08:24] !log jayme@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:08:26] !log jayme@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:08:54] !log jayme@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:08:56] !log jayme@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:09:32] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:09:36] !log jayme@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:09:37] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:09:42] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1015.eqiad.wmnet with reason: host reimage [12:10:18] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:10:47] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:10:49] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:11:01] (03CR) 10Isabelle Hurbain-Palatin: [C: 03+1] Enable new Linter UI for namespace, tag and template for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894733 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [12:11:49] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:12:03] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:12:13] !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:12:27] !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:12:28] !log jayme@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:12:43] !log jayme@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:13:18] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:13:31] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:13:45] (JobUnavailable) resolved: Reduced availability for job k8s-pods-tls in k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:13:57] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:14:11] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubernetes1016.eqiad.wmnet with OS bullseye [12:14:18] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:14:19] !log jayme@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:14:39] !log jayme@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:14:41] !log jayme@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:14:48] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:15:02] !log jayme@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:15:02] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:15:03] !log jayme@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:15:25] !log jayme@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:15:26] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:15:42] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:16:39] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1017.eqiad.wmnet with reason: host reimage [12:17:40] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:17:52] (03PS1) 10Elukey: custom_deploy.d: upgrade istio ml-staging's config to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/895181 (https://phabricator.wikimedia.org/T324542) [12:18:05] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:18:07] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:18:08] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:18:10] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:18:11] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:18:15] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:18:16] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:18:18] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:18:19] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:18:21] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:18:41] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:18:43] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:19:18] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:19:20] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:19:34] (03PS1) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 [12:19:52] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1017.eqiad.wmnet with reason: host reimage [12:25:02] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubernetes1015.eqiad.wmnet with OS bullseye [12:25:42] (03CR) 10Elukey: [C: 03+2] custom_deploy.d: upgrade istio ml-staging's config to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/895181 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey) [12:27:27] (03CR) 10Muehlenhoff: "Looking at the data I think there's still some inconsistency with the kernel detection. I think it only applies to the "list" action, the " [puppet] - 10https://gerrit.wikimedia.org/r/894729 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [12:27:46] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1018.eqiad.wmnet with OS bullseye [12:27:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST gateways) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:30:48] (03CR) 10Joal: [C: 03+1] "Awesome - I don't know if our issue is related to this problem particularly, but this change can only help :)" [puppet] - 10https://gerrit.wikimedia.org/r/895127 (https://phabricator.wikimedia.org/T310293) (owner: 10Nicolas Fraison) [12:31:49] (03PS3) 10Slyngshede: data.yaml add sgimeno to deployment group. [puppet] - 10https://gerrit.wikimedia.org/r/890797 (https://phabricator.wikimedia.org/T330070) [12:33:21] (03PS6) 10Slyngshede: SUL account linking [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) [12:34:28] (03CR) 10Slyngshede: SUL account linking (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) (owner: 10Slyngshede) [12:35:17] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) (owner: 10Slyngshede) [12:37:36] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1017.eqiad.wmnet with OS bullseye [12:37:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST gateways) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:38:08] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1019.eqiad.wmnet with OS bullseye [12:41:27] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1018.eqiad.wmnet with reason: host reimage [12:44:06] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1018.eqiad.wmnet with reason: host reimage [12:47:33] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [12:49:10] (03CR) 10Btullis: [V: 03+1 C: 03+2] Disable all gobblin jobs to allow for HDFS maintenance [puppet] - 10https://gerrit.wikimedia.org/r/894537 (https://phabricator.wikimedia.org/T329073) (owner: 10Btullis) [12:49:31] (03PS1) 10Volans: sre.loadbalancer.restart-pybal: simplify call [cookbooks] - 10https://gerrit.wikimedia.org/r/895184 [12:49:33] (03PS1) 10Volans: sre.network: fix minor bugs and type hints [cookbooks] - 10https://gerrit.wikimedia.org/r/895185 [12:49:35] (03PS1) 10Volans: sre.hadoop: do not override API method [cookbooks] - 10https://gerrit.wikimedia.org/r/895206 [12:49:37] (03PS1) 10Volans: sre.{ganeti,hardware,hosts}: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895207 [12:49:39] (03PS1) 10Volans: sre.k8s: fix issues reported by mypy [cookbooks] - 10https://gerrit.wikimedia.org/r/895208 [12:49:41] (03PS1) 10Volans: sre.mysql.upgrade: remove wrong line [cookbooks] - 10https://gerrit.wikimedia.org/r/895209 [12:49:43] (03PS1) 10Volans: sre.mediawiki.route-traffic: fix wrong call [cookbooks] - 10https://gerrit.wikimedia.org/r/895210 [12:49:45] (03PS1) 10Volans: sre.{idm,pdus,puppet}: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895211 [12:49:47] (03PS1) 10Volans: sre.loadbalancer.restart-pybal: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895212 [12:49:49] (03PS1) 10Volans: sre.discovery: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895213 [12:49:51] (03PS1) 10Volans: sre.wdqs.data-transfer: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895214 [12:49:53] (03PS1) 10Volans: sre.k8s.pool-depool-cluster: ignore mypy errors [cookbooks] - 10https://gerrit.wikimedia.org/r/895215 [12:49:55] (03PS1) 10Volans: tox: add mypy testing [cookbooks] - 10https://gerrit.wikimedia.org/r/895216 [12:50:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:51:41] (03PS10) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) [12:52:21] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1019.eqiad.wmnet with reason: host reimage [12:52:54] (03PS1) 10FNegri: clouddb: depool clouddb[1013-1014] [puppet] - 10https://gerrit.wikimedia.org/r/895217 (https://phabricator.wikimedia.org/T329073) [12:53:19] (03CR) 10CI reject: [V: 04-1] clouddb: depool clouddb[1013-1014] [puppet] - 10https://gerrit.wikimedia.org/r/895217 (https://phabricator.wikimedia.org/T329073) (owner: 10FNegri) [12:54:53] (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - upgrade to upstream 1.4.0 release [deployment-charts] - 10https://gerrit.wikimedia.org/r/894643 (https://phabricator.wikimedia.org/T331282) (owner: 10Ottomata) [12:55:13] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10fnegri) @Marostegui @aborrero the patch above should depool clouddb1013 and clouddb1014. I don't think clouddb1021 can be depooled easily as it looks li... [12:55:25] (03CR) 10Muehlenhoff: "The script looks good, just left a few typos inline. However, testing the implemented logout workflow made me wonder we actually need this" [puppet] - 10https://gerrit.wikimedia.org/r/695255 (owner: 10Jbond) [12:55:33] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1019.eqiad.wmnet with reason: host reimage [12:55:33] (03CR) 10Ssingh: [C: 03+2] hiera: temporarily removed dns1001 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/894654 (https://phabricator.wikimedia.org/T329073) (owner: 10Ssingh) [12:55:44] (03PS11) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) [12:55:50] (03CR) 10Jbond: "updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [12:55:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:55:59] (03PS2) 10FNegri: clouddb: depool clouddb[1013-1014] [puppet] - 10https://gerrit.wikimedia.org/r/895217 (https://phabricator.wikimedia.org/T329073) [12:57:05] !log removing dns1001 from authdns_servers for T329073 [12:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:11] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [12:57:39] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [12:57:51] (03CR) 10CI reject: [V: 04-1] sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [12:58:36] PROBLEM - Bird Internet Routing Daemon on dns1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:59:12] ^ expected [12:59:54] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:00:28] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1018.eqiad.wmnet with OS bullseye [13:00:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:04:48] !log drain ganeti1011 for eventual reimage to Bullseye T311687 [13:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:55] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [13:04:58] (03CR) 10Slyngshede: [V: 03+2] SUL account linking [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) (owner: 10Slyngshede) [13:05:04] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] SUL account linking [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) (owner: 10Slyngshede) [13:05:23] (03PS2) 10Slyngshede: LOGIN: Add custom WikiMedia SSO login page. [software/bitu] - 10https://gerrit.wikimedia.org/r/891797 [13:05:26] (03CR) 10Slyngshede: [V: 03+2] LOGIN: Add custom WikiMedia SSO login page. [software/bitu] - 10https://gerrit.wikimedia.org/r/891797 (owner: 10Slyngshede) [13:05:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:06:24] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 2 others: Depooled servers may still be taken into account for query service maxlag - https://phabricator.wikimedia.org/T331405 (10Joe) To ensure I understood your problem correctly: why were those servers not getting updated anymore? Update... [13:07:18] (03PS1) 10Ottomata: admin_ng - upgrade flink-kubernetes-operator to 1.4.0 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/895218 (https://phabricator.wikimedia.org/T331282) [13:08:22] (03CR) 10Btullis: [C: 03+1] "Looks good. Feel free to merge and deploy at any time." [deployment-charts] - 10https://gerrit.wikimedia.org/r/895218 (https://phabricator.wikimedia.org/T331282) (owner: 10Ottomata) [13:09:50] PROBLEM - Check systemd state on kubernetes1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:43] (03CR) 10Ottomata: [C: 03+2] admin_ng - upgrade flink-kubernetes-operator to 1.4.0 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/895218 (https://phabricator.wikimedia.org/T331282) (owner: 10Ottomata) [13:11:05] (03CR) 10Elukey: [C: 03+1] admin_ng - upgrade flink-kubernetes-operator to 1.4.0 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/895218 (https://phabricator.wikimedia.org/T331282) (owner: 10Ottomata) [13:11:30] (03PS1) 10David Caro: wmcs: update ceph alerts dashboard [alerts] - 10https://gerrit.wikimedia.org/r/895220 [13:12:54] (03CR) 10CI reject: [V: 04-1] wmcs: update ceph alerts dashboard [alerts] - 10https://gerrit.wikimedia.org/r/895220 (owner: 10David Caro) [13:13:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [13:14:09] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1019.eqiad.wmnet with OS bullseye [13:14:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [13:15:05] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 2 others: Depooled servers may still be taken into account for query service maxlag - https://phabricator.wikimedia.org/T331405 (10dcausse) >>! In T331405#8672341, @Joe wrote: > Updates shouldn't depend on where the discovery dns record poin... [13:15:20] !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:15:58] !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:16:20] (03PS2) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 [13:16:42] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:17:14] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jnuche) @akosiaris thanks for the feedback. Just to clarify, we can work around the issue currently, but it makes the frequent Scap self-update process more error-... [13:23:57] (03PS1) 10Slyngshede: Remove duplicate installed apps from base settings. [software/bitu] - 10https://gerrit.wikimedia.org/r/895221 [13:26:56] (03CR) 10Muehlenhoff: "Looking at the deploy repo on the thumbor hosts the last deployment to 3d2png was done in 2019 by Marko who ported it to Stretch. So I thi" [puppet] - 10https://gerrit.wikimedia.org/r/894542 (owner: 10Jaime Nuche) [13:27:30] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:29:13] (03PS1) 10Filippo Giunchedi: o11y: exclude Exemplars from Thanos Query errors [alerts] - 10https://gerrit.wikimedia.org/r/895222 [13:31:34] 10SRE, 10ops-esams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [13:32:36] (03PS1) 10Arturo Borrero Gonzalez: toolforge: wmcs-k8s-get-cert.sh: fix inverted logic [puppet] - 10https://gerrit.wikimedia.org/r/895224 [13:35:27] (03CR) 10David Caro: clouddb: depool clouddb[1013-1014] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895217 (https://phabricator.wikimedia.org/T329073) (owner: 10FNegri) [13:36:46] (03PS3) 10FNegri: clouddb: depool clouddb[1013-1014] [puppet] - 10https://gerrit.wikimedia.org/r/895217 (https://phabricator.wikimedia.org/T329073) [13:37:08] (03CR) 10CI reject: [V: 04-1] clouddb: depool clouddb[1013-1014] [puppet] - 10https://gerrit.wikimedia.org/r/895217 (https://phabricator.wikimedia.org/T329073) (owner: 10FNegri) [13:37:31] (03CR) 10FNegri: clouddb: depool clouddb[1013-1014] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895217 (https://phabricator.wikimedia.org/T329073) (owner: 10FNegri) [13:37:57] (03PS4) 10FNegri: clouddb: depool clouddb[1013-1014] [puppet] - 10https://gerrit.wikimedia.org/r/895217 (https://phabricator.wikimedia.org/T329073) [13:40:33] (03PS1) 10Jbond: pki: failover to codfw for switch reboot [dns] - 10https://gerrit.wikimedia.org/r/895225 (https://phabricator.wikimedia.org/T329073) [13:44:32] (03PS5) 10Jbond: mod_auth_cas: add logout script for mod_auth_cas [puppet] - 10https://gerrit.wikimedia.org/r/695255 [13:45:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I haven't checked IP addresses, but the change LGTM otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/895217 (https://phabricator.wikimedia.org/T329073) (owner: 10FNegri) [13:45:52] (03PS1) 10Mforns: analytics::refinery::job::eventlogging_to_druid: Default to deploy-mode cluster [puppet] - 10https://gerrit.wikimedia.org/r/895228 [13:46:29] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10cmooney) [13:49:31] Heads up - network maintenance Eqiad row A starting in ~10 mins [13:49:41] Netbox will be unavailable for this time so please avoid running any cookbooks that depend on it until 15:00 UTC (check sre channel for updates) [13:49:49] Also mr1-eqiad will be affected by the work too - so management network in eqiad will be unavailable [13:50:20] !log staging Junos files to individual VC members eqiad row A to prep for upgrade [13:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:35] !log disabling Puppet in eqiad/esams/drmrs for forthcoming Switch maintenance T329073 [13:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:42] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [13:52:45] 10SRE, 10MediaWiki-File-management, 10Traffic, 10MW-1.40-notes (1.40.0-wmf.27; 2023-03-13), and 2 others: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10Jdforrester-WMF) 05In progress→03Resolved [13:53:38] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39993/console" [puppet] - 10https://gerrit.wikimedia.org/r/895228 (owner: 10Mforns) [13:54:03] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: exclude Exemplars from Thanos Query errors [alerts] - 10https://gerrit.wikimedia.org/r/895222 (owner: 10Filippo Giunchedi) [13:54:30] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1020.eqiad.wmnet with OS bullseye [13:54:44] (03PS2) 10David Caro: wmcs: update ceph alerts dashboard [alerts] - 10https://gerrit.wikimedia.org/r/895220 [13:55:01] !log depool moss-fe1001 T329073 [13:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:34] !log depool ms-fe1009 T329073 [13:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:46] (03CR) 10Jbond: "thanks comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/695255 (owner: 10Jbond) [13:56:00] (03CR) 10Jbond: [C: 03+2] pki: failover to codfw for switch reboot [dns] - 10https://gerrit.wikimedia.org/r/895225 (https://phabricator.wikimedia.org/T329073) (owner: 10Jbond) [13:56:50] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [13:57:20] (03PS1) 10Jbond: Revert "pki: failover to codfw for switch reboot" [dns] - 10https://gerrit.wikimedia.org/r/895194 [13:58:34] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10jbond) [13:58:35] !log depool thanos-fe1001 T329073 [13:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:41] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [13:59:09] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10jbond) [13:59:09] !log failover pki.discovery.wmnet to codfw T329073 [13:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:37] (03CR) 10FNegri: [C: 03+2] clouddb: depool clouddb[1013-1014] [puppet] - 10https://gerrit.wikimedia.org/r/895217 (https://phabricator.wikimedia.org/T329073) (owner: 10FNegri) [13:59:45] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MatthewVernon) [14:00:05] akosiaris: Dear deployers, time to do the Kubernetes upgrade wikikube eqiad deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T0900). [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T1400). [14:00:05] sbailey: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T1400) [14:00:34] I am here :-) [14:00:42] 10SRE, 10ops-codfw, 10Data-Persistence (work done), 10decommission-hardware: decommission db2095.codfw.wmnet - https://phabricator.wikimedia.org/T330975 (10Papaul) a:03Jhancock.wm [14:00:56] PROBLEM - Hadoop ResourceManager on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Resourcemanager_process [14:01:12] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-resourcemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:43] I’m around but in a meeting [14:01:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895184 (owner: 10Volans) [14:02:02] PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-resourcemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:08] ACKNOWLEDGEMENT - Hadoop ResourceManager on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager Btullis Shut down to avoind new jobs being started during T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Resourcemanager_process [14:02:20] no rush on the config change [14:02:27] * Lucas_WMDE looks [14:02:38] !log mvernon@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1001" [14:02:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:03:19] looks harmless enough, I think I can deploy it on the side [14:03:31] (03PS1) 10Alexandros Kosiaris: wikikube eqiad: Fix ippool to 10.67.128.0/18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/895229 (https://phabricator.wikimedia.org/T331126) [14:03:32] thanks [14:05:02] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase101[69].eqiad.wmnet [14:05:07] I’d also maybe like to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/894599/, if I could get a +1 on that that would be great ;) [14:05:35] Lucas_WMDE: you don't want to upgrade right now [14:05:40] ok [14:05:42] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase102[18].eqiad.wmnet [14:05:44] it can wait too [14:05:48] multiple maintenances running [14:05:51] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1031.eqiad.wmnet [14:05:53] but I 'll revie once they are done [14:06:01] is that also about the config change or only the deployment-charts? [14:06:14] it's also about the eqiad row maintenance [14:06:23] akosiaris: thanks [14:06:27] aaah, config change ? which config change? [14:06:35] the one by sbailey [14:06:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895185 (owner: 10Volans) [14:06:37] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/894733/ [14:06:48] for the backport+config windown [14:06:48] Network maintenance window is open, just waiting on the switches to verify the software, I'll confirm here before hard downtime starts [14:06:56] yeah you probably want to avoid [14:06:59] ack [14:07:00] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 238 hosts with reason: eqiad row A upgrade [14:07:00] PROBLEM - Check systemd state on ms-be2070 is CRITICAL: CRITICAL - degraded: The following units failed: smartmontools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:05] sbailey: not deploying then, sorry [14:07:12] I hope it can wait [14:07:28] RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:29] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1037'] [14:07:30] no harm will happen btw, but you run a high risk of becoming frustrated in the process [14:07:35] no problem, looks like a lot going on at the same time, will shoot for next backport windows [14:07:40] and frustrating others too [14:07:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895206 (owner: 10Volans) [14:07:47] that sounds like harm to me ;) [14:07:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] termbox(prod): update to 2023-03-06-101138-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/894599 (https://phabricator.wikimedia.org/T309176) (owner: 10Lucas Werkmeister (WMDE)) [14:07:56] good point ;-) [14:07:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:08:02] uh, +2? not +1? [14:08:09] oh dammit [14:08:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] termbox(prod): update to 2023-03-06-101138-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/894599 (https://phabricator.wikimedia.org/T309176) (owner: 10Lucas Werkmeister (WMDE)) [14:08:17] ok ^^ [14:08:21] I'll reschedule for the late backport window if now is a bad time :-) [14:08:21] fixed, thanks [14:08:26] (03CR) 10Volans: "I did a first pass on the code, skipping the test." [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto) [14:08:27] thanks for the review :) [14:08:28] RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:41] sbailey: good luck (I probably won’t be around then but someone else should be) [14:08:47] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] wikikube eqiad: Fix ippool to 10.67.128.0/18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/895229 (https://phabricator.wikimedia.org/T331126) (owner: 10Alexandros Kosiaris) [14:08:51] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1020.eqiad.wmnet with reason: host reimage [14:08:53] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1020.eqiad.wmnet with reason: host reimage [14:08:59] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1038'] [14:09:18] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1038'] [14:09:35] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 238 hosts with reason: eqiad row A upgrade [14:09:39] thanks Lucas_WMDE [14:09:44] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1038'] [14:09:58] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f4ffc353-a529-4620-994f-ae7b737f3c7a) set by cmooney@cumin1001 for 2:00:00 on 238 host(s... [14:11:38] ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Nicolas Fraison T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:11:38] ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Nicolas Fraison T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:11:38] ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Nicolas Fraison T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:12:56] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1001" [14:12:57] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2070.codfw.wmnet with OS bullseye [14:13:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2070.codfw.wmnet with OS bullseye completed: - ms-b... [14:13:52] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10MatthewVernon) @Papaul I've fixed the underlying problems and you'll see ms-be2070 reimaged to successful completion now, so hopefully that's you unblo... [14:13:55] !log kubectl cordon kubernetes{1005,1007,1008,1017,1018}.eqiad.wmnet T329073 [14:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:01] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [14:14:10] PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:10] PROBLEM - Hadoop NodeManager on analytics1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:12] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:16] PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:16] PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:26] PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:28] PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:32] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:34] PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:36] PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:36] PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:36] PROBLEM - Hadoop NodeManager on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:37] PROBLEM - Hadoop NodeManager on an-worker1143 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:38] PROBLEM - Hadoop NodeManager on an-worker1101 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:39] PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:40] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:42] PROBLEM - Hadoop NodeManager on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:46] PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:46] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:50] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:50] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:50] PROBLEM - Hadoop NodeManager on an-worker1097 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:51] PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:54] PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:56] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:56] PROBLEM - Hadoop NodeManager on analytics1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:58] PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:00] PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:00] PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:00] PROBLEM - Hadoop NodeManager on an-worker1134 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:06] PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:12] PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:12] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:13] PROBLEM - Hadoop NodeManager on an-worker1094 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:13] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:14] PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:16] PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:18] (ProbeDown) firing: (3) Service mathoid:4001 has failed probes (http_mathoid_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1037'] [14:15:20] PROBLEM - Hadoop NodeManager on an-worker1144 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:30] PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:32] PROBLEM - Hadoop NodeManager on analytics1067 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:34] PROBLEM - Hadoop NodeManager on an-worker1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:52] hi [14:16:18] Service mathoid:4001 has failed probes (http_mathoid_ip4) [14:16:18] RECOVERY - Check systemd state on ms-be2070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:19] topranks: ignore my helmfile deploys, I am on purpose working on the other rows [14:16:29] bblack: sigh, it expired? ignore [14:16:32] is all the analytics line-noise expected? [14:16:34] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mr1-eqiad with reason: eqiad row A upgrade [14:16:43] ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Nicolas Fraison T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:16:43] ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Nicolas Fraison T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:16:43] ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Nicolas Fraison T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:16:44] the analytics ones I can't answer [14:16:49] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mr1-eqiad with reason: eqiad row A upgrade [14:16:55] well I see some acks anyways [14:17:03] akosiaris: ack, thanks for the detail [14:17:05] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1020.eqiad.wmnet with OS bullseye [14:17:06] akosiaris: ok so mathoid is just an expired downtime? [14:17:09] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0a07bba2-0f50-4eec-9718-0c768add34f3) set by cmooney@cumin1001 for 2:00:00 on 1 host(s)... [14:17:15] bblack: yes [14:17:22] ack [14:17:32] akosiaris: are you reimaging hosts? [14:17:33] (03CR) 10Jbond: [C: 03+1] "lgtm q for offline but not blocking" [cookbooks] - 10https://gerrit.wikimedia.org/r/895207 (owner: 10Volans) [14:17:42] topranks: no [14:17:45] not anymore that is [14:17:50] ok cool [14:18:01] yeah mr1-eqiad is in row A, so mgmt comms will be interrupted [14:18:33] I'll resolve the new incident/page [14:18:35] resolved it [14:18:42] ah, ok [14:19:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1038'] [14:19:53] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:20:38] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:20:53] (03CR) 10Jbond: [C: 03+1] sre.{ganeti,hardware,hosts}: fix mypy issues (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/895207 (owner: 10Volans) [14:20:54] !log issuing reboot to upgrade asw2-a-eqiad virtual-chassis to Junos 21.4 [14:20:55] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:02] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:21:09] jouncebot: now [14:21:09] For the next 1 hour(s) and 38 minute(s): Kubernetes upgrade wikikube eqiad (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T0900) [14:21:09] For the next 0 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T1400) [14:21:10] For the next 0 hour(s) and 38 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T1400) [14:21:11] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:21:32] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:21:44] PROBLEM - Check systemd state on ms-be2070 is CRITICAL: CRITICAL - degraded: The following units failed: smartmontools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:46] Hi folks, would it be possible to deploy a last-minute config change? [14:22:10] Daimona: there is a network maintenance in eqiad aking down a whole row [14:23:00] so I'd say better wait until that's completed, various hosts wouldn't be reachable anyway [14:23:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895208 (owner: 10Volans) [14:23:32] OK, thanks anyway [14:23:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895209 (owner: 10Volans) [14:24:33] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:24:34] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:24:37] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:24:59] PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:25:41] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:25:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895210 (owner: 10Volans) [14:25:54] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:26:06] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:26:07] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:26:16] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:26:28] PROBLEM - Host ps1-e3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:26:48] PROBLEM - Host ps1-f1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:26:50] PROBLEM - Host ps1-f2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:26:52] PROBLEM - Host ps1-e2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:26:54] PROBLEM - Host ps1-e1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:26:54] PROBLEM - Host ps1-f3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:27:06] PROBLEM - Host ps1-e4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:27:08] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:27:08] PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:27:16] PROBLEM - Host ps1-f4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:27:30] (virtual-chassis crash) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [14:27:30] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 4 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:27:48] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:28:10] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:28:22] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 12 down 4: https://wikitech.wikimedia.org/wiki/HAProxy [14:28:24] ^ expected [14:28:25] (03PS1) 10Daimona Eaytoy: Enable $wgCampaignEventsEnableMultipleOrganizers on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895234 (https://phabricator.wikimedia.org/T327470) [14:28:28] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:28:38] PROBLEM - MariaDB Replica IO: s5 on clouddb1020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3315 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:28:40] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:28:42] (03PS1) 10Bking: search airflow: configure for postgres [puppet] - 10https://gerrit.wikimedia.org/r/895235 (https://phabricator.wikimedia.org/T327970) [14:28:42] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 207, down: 5, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:28:48] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [14:28:50] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:28:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 192, down: 5, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:28:58] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:29:04] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [14:29:12] PROBLEM - Docker registry health on registry1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 235 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker [14:29:17] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:29:18] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:29:18] PROBLEM - MariaDB Replica IO: es5 on es1023 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es1024.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on es1024.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:29:20] PROBLEM - MariaDB Replica IO: es5 on es1025 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es1024.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on es1024.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:29:23] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:29:24] PROBLEM - MariaDB Replica IO: s8 on clouddb1020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3318 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:29:27] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:29:30] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry1004.eqiad.wmnet:443/v2/wikimedia-stretch/manifests/latest - 362 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Docker [14:29:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/895235 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [14:29:48] (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:56] PROBLEM - MariaDB Replica IO: s5 on clouddb1016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3315 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:29:56] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 12 down 4: https://wikitech.wikimedia.org/wiki/HAProxy [14:30:15] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+1] "LGTM just waiting for test" [puppet] - 10https://gerrit.wikimedia.org/r/895228 (owner: 10Mforns) [14:30:16] PROBLEM - Host mr1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:17] PROBLEM - MariaDB Replica IO: s8 on clouddb1016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3318 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:30:18] (ProbeDown) firing: (4) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:33] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:34] that one ^ is due to the row maint [14:30:38] ! incidents [14:30:51] !incidents [14:30:52] 3453 (ACKED) [4x] ProbeDown (ip4 probes/service eqiad) [14:30:52] 3452 (RESOLVED) [3x] ProbeDown (ip4 probes/service eqiad) [14:30:52] 3451 (RESOLVED) [2x] ProbeDown (ip4 probes/service eqiad) [14:30:52] 3450 (RESOLVED) ProbeDown (ip4 probes/service eqiad) [14:30:52] 3449 (RESOLVED) [7x] ProbeDown (ip4 probes/service eqiad) [14:30:53] 3446 (RESOLVED) PHPFPMTooBusy parsoid (php7.4-fpm.service codfw) [14:31:01] ok [14:31:05] !resolve 3453 [14:31:06] 3453 (ACKED) [4x] ProbeDown (ip4 probes/service eqiad) [14:31:41] we need some more detail in those descriptions at some point [14:32:18] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: fetch-rings-eqiad.service,fetch-rings-thanos.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:28] switches are starting to come back online [14:32:38] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:32:43] \o/ [14:33:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895211 (owner: 10Volans) [14:33:20] RECOVERY - Check systemd state on ms-be2070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:45] (JobUnavailable) firing: (24) Reduced availability for job burrow in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:33:46] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/895212 (owner: 10Volans) [14:34:30] PROBLEM - Check systemd state on prometheus2005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895213 (owner: 10Volans) [14:34:32] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:34:35] (KafkaUnderReplicatedPartitions) firing: (2) Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:34:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:34:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:34:48] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [14:34:48] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:34:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895214 (owner: 10Volans) [14:35:06] (Wikidata Reliability Metrics - Median Payload alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert [14:35:09] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [14:35:10] RECOVERY - MariaDB Replica IO: es5 on es1025 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:35:12] RECOVERY - Host asw2-a-eqiad is UP: PING WARNING - Packet loss = 60%, RTA = 0.85 ms [14:35:14] RECOVERY - Host ps1-e2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [14:35:14] RECOVERY - Host ps1-e3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [14:35:14] RECOVERY - Host ps1-e4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [14:35:14] RECOVERY - Host ps1-f4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [14:35:16] RECOVERY - Host ps1-f2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [14:35:16] RECOVERY - Host ps1-e1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.89 ms [14:35:18] RECOVERY - Host ps1-f3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.47 ms [14:35:20] RECOVERY - Host ps1-f1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [14:35:22] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:35:22] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 0.285 second response time https://wikitech.wikimedia.org/wiki/Docker [14:35:28] RECOVERY - Host fasw-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [14:35:36] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:35:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895215 (owner: 10Volans) [14:35:38] PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:40] RECOVERY - Host asw2-d-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.75 ms [14:35:48] PROBLEM - Check systemd state on prometheus1006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:00] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:36:08] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:36:16] RECOVERY - Host mr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [14:36:16] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:36:25] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:36:28] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:36:30] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 176, active_shards: 352, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [14:36:30] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:36:36] RECOVERY - Host asw2-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [14:36:38] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:36:44] RECOVERY - Host asw2-b-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [14:36:46] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:36:47] (03CR) 10Nicolas Fraison: [C: 03+1] sre.hadoop: do not override API method [cookbooks] - 10https://gerrit.wikimedia.org/r/895206 (owner: 10Volans) [14:37:02] RECOVERY - Docker registry health on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker [14:37:13] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895216 (owner: 10Volans) [14:37:16] RECOVERY - MariaDB Replica IO: s8 on clouddb1020 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:37:46] RECOVERY - MariaDB Replica IO: s5 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:38:06] RECOVERY - MariaDB Replica IO: s8 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:38:07] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:38:18] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:38:24] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:38:26] RECOVERY - MariaDB Replica IO: s5 on clouddb1020 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:38:26] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:38:32] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [14:38:35] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [14:38:46] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:38:50] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:39:26] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/895235 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [14:40:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_netflow_hourly.service,eventlogging_to_druid_network_flows_internal_hourly.service,refine_event_sanitized_analytics_immediate.service,refine_event_sanitized_main_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:42] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:40:46] (03Abandoned) 10Jbond: Revert "pki: failover to codfw for switch reboot" [dns] - 10https://gerrit.wikimedia.org/r/895194 (owner: 10Jbond) [14:40:56] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:40:57] !log enabling Puppet in eqiad/esams/drmrs after completed Switch maintenance T329073 [14:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:05] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:41:05] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [14:41:10] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:41:18] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:41:44] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:42:01] !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for mr1-eqiad [14:42:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mr1-eqiad [14:42:02] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10jbond) [14:42:14] PROBLEM - MariaDB Replica Lag: es5 on es1023 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1126.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:42:18] !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for 238 hosts [14:42:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [14:42:52] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:42:56] PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:42:56] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:43:02] PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:02] PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:04] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:43:08] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:43:14] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:16] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:43:18] PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:18] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:43:18] RECOVERY - Hadoop NodeManager on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:22] PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:28] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:43:30] PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:30] PROBLEM - Hadoop NodeManager on an-worker1096 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:30] PROBLEM - Check systemd state on kubemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:34] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 238 hosts [14:43:36] PROBLEM - Hadoop NodeManager on analytics1070 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:37] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:43:47] RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:47] RECOVERY - Hadoop NodeManager on an-worker1088 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:50] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:54] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:54] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:54] PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:58] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:44:00] PROBLEM - Bird Internet Routing Daemon on dns1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:44:00] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:02] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 389 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 779, active_shards: 1170, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 389, delayed_unassigned_shards: 0, number_of_pending [14:44:02] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 75.04810776138551 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:44:06] PROBLEM - Check systemd state on kubernetes1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:07] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:44:08] PROBLEM - Hadoop NodeManager on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:08] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 389 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 779, active_shards: 1170, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 389, delayed_unassigned_shards: 0, number_of_pending [14:44:08] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 75.04810776138551 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:44:10] PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:14] PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:16] PROBLEM - Hadoop NodeManager on an-worker1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:16] PROBLEM - Hadoop NodeManager on an-worker1122 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:20] PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:28] PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:28] PROBLEM - Hadoop NodeManager on analytics1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:28] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:30] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:30] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:30] PROBLEM - Check systemd state on ms-be1044 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:34] RECOVERY - Hadoop ResourceManager on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Resourcemanager_process [14:44:48] (ProbeDown) firing: (37) Service irc1001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:06] !log uncordon kubernetes{1005,1007,1008,1017,1018}.eqiad.wmnet [14:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:17] !log uncordon kubernetes{1005,1007,1008,1017,1018}.eqiad.wmnet T331126 [14:45:18] PROBLEM - Check systemd state on registry1003 is CRITICAL: CRITICAL - degraded: The following units failed: build-homepage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:22] RECOVERY - Hadoop NodeManager on analytics1066 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:45:22] PROBLEM - Check systemd state on prometheus1005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:23] T331126: Update wikikube eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T331126 [14:45:24] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-discovery.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:26] (virtual-chassis crash) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [14:45:55] (JobUnavailable) resolved: (117) Reduced availability for job alertmanager in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:46:08] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:11] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:46:16] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:46:20] RECOVERY - Hadoop NodeManager on an-worker1106 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:46:21] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:46:30] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:46:47] (KafkaUnderReplicatedPartitions) resolved: (2) Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:46:51] (Wikidata Reliability Metrics - Median Payload alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert [14:46:56] RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:46:57] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [14:47:00] RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:47:06] RECOVERY - Hadoop NodeManager on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:47:06] RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:47:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:47:16] RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:47:19] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1021.eqiad.wmnet with OS bullseye [14:47:28] RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:47:37] (ProbeDown) resolved: (39) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:47:38] RECOVERY - Hadoop NodeManager on an-worker1134 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:47:42] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:05] (03PS1) 10JMeybohm: secrets/ssl: Remove keys for kubernetes etcd clusters [labs/private] - 10https://gerrit.wikimedia.org/r/895237 (https://phabricator.wikimedia.org/T329717) [14:48:06] (ProbeDown) resolved: (38) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:48:10] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:32] (03CR) 10CDanis: "This looks to me like it would page several times daily, in the past few days at least:" [alerts] - 10https://gerrit.wikimedia.org/r/884039 (owner: 10Giuseppe Lavagetto) [14:48:32] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:48:57] any known cause for the commons alert? [14:49:12] RECOVERY - Hadoop NodeManager on analytics1059 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:49:25] (03PS12) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) [14:49:30] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:32] RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:49:35] oh sorry I didn't read the "resolved" part :) [14:50:09] (03PS1) 10Ssingh: Revert "hiera: temporarily removed dns1001 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/895272 [14:50:15] I didn't see it fire, either, but maybe timing issues [14:50:18] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10cmooney) Happy to say the upgrade went as expected, no issues encountered. All devices now back online running 21.4R3-S1.5. [14:50:32] RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:50:34] RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:51:14] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:51:32] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:51:34] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 779, active_shards: 1293, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 262, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_ [14:51:34] t_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 82.93778062860808 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:51:38] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1004 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 779, active_shards: 1300, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_tasks: 4, number_of_ [14:51:38] t_fetch: 0, task_max_waiting_in_queue_millis: 2072, active_shards_percent_as_number: 83.38678640153945 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:51:38] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 779, active_shards: 1302, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 253, delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_ [14:51:38] t_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.51507376523412 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:51:46] RECOVERY - Hadoop NodeManager on an-worker1079 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:51:50] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 779, active_shards: 1317, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 238, delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_ [14:51:50] t_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.47722899294419 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:51:54] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 779, active_shards: 1323, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 232, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_ [14:51:54] t_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.86209108402822 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:52:22] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:52:28] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 779, active_shards: 1369, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 186, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_ [14:52:28] t_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.81270044900577 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:52:30] !log bking@cumin2002 unban row A cloudelastic nodes T329073 [14:52:31] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Andrew) the following hosts paged during this maintenance: ` NodeDown wmcs cloudvirt1023:9100 (node eqiad) NodeDown wmcs cloudvirt1024:9100 (node eqiad... [14:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:38] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [14:53:00] RECOVERY - Hadoop NodeManager on an-worker1136 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:53:04] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:53:13] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: sync [14:53:18] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:42] RECOVERY - Hadoop NodeManager on analytics1067 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:53:58] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: sync [14:54:10] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:54:18] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:54:19] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [14:54:20] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:54:30] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:54:37] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [14:54:42] RECOVERY - Hadoop NodeManager on an-worker1143 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:54:47] !log T331126 toolhub deployed, https://toolhub.wikimedia.org/ operational again [14:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:53] T331126: Update wikikube eqiad to k8s 1.23 - https://phabricator.wikimedia.org/T331126 [14:55:16] RECOVERY - Bird Internet Routing Daemon on dns1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:55:18] RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:55:37] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: sync [14:55:40] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:56:13] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: sync [14:56:14] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [14:56:39] !log bking@cumin2002 unban production row A elastic nodes from all clusters T329073 [14:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:44] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [14:56:45] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: sync [14:56:46] RECOVERY - Hadoop NodeManager on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:56:58] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: sync [14:56:59] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [14:57:26] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [14:57:27] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [14:57:32] RECOVERY - Hadoop NodeManager on analytics1058 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:57:37] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [14:57:38] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: sync [14:57:42] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: temporarily removed dns1001 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/895272 (owner: 10Ssingh) [14:58:01] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: sync [14:58:02] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: sync [14:58:22] RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:58:23] RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:58:26] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: sync [14:58:28] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main [14:58:30] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1038'] [14:58:32] PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:40] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:58:46] RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:58:46] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1031.eqiad.wmnet [14:58:50] PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:58:51] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase102[18].eqiad.wmnet [14:58:56] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [14:59:00] PROBLEM - Check systemd state on analytics1069 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:23] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase101[69].eqiad.wmnet [14:59:30] RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:33] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [14:59:33] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:59:34] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: sync [14:59:37] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:39] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:41] RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:43] RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:45] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:45] RECOVERY - Hadoop NodeManager on an-worker1108 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:47] (03PS7) 10Jbond: sre.hardware.upgrade-firmware: Call provision cookbook after upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 [14:59:49] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: sync [14:59:49] RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:49] RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:50] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/echostore: sync [14:59:51] RECOVERY - Hadoop NodeManager on an-worker1096 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:53] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:57] RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:03] RECOVERY - Hadoop NodeManager on analytics1065 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:05] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/echostore: sync [15:00:06] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: sync [15:00:07] RECOVERY - Check systemd state on prometheus1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:09] RECOVERY - Hadoop NodeManager on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:09] RECOVERY - Hadoop NodeManager on an-worker1088 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:13] RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:14] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:17] RECOVERY - Check systemd state on analytics1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:19] RECOVERY - Hadoop NodeManager on an-worker1107 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:21] RECOVERY - Hadoop NodeManager on an-worker1094 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:32] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync [15:00:33] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync [15:00:43] RECOVERY - Check systemd state on prometheus1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:45] RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:45] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:51] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync [15:00:52] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: sync [15:00:58] (03PS1) 10Btullis: Reenable the gobblin timers after switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/895239 (https://phabricator.wikimedia.org/T329073) [15:00:59] RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:02] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: sync [15:01:03] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [15:01:16] !log repooling dns1001: authdns-update can now be run again [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:31] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [15:01:32] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: sync [15:01:40] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1021.eqiad.wmnet with reason: host reimage [15:01:58] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync [15:01:59] PROBLEM - Hadoop NodeManager on analytics1070 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:01:59] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: sync [15:02:11] PROBLEM - Hadoop NodeManager on an-worker1134 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:02:14] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: sync [15:02:15] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/SERVICE_NAME: sync [15:02:16] (03PS1) 10FNegri: Revert "clouddb: depool clouddb[1013-1014]" [puppet] - 10https://gerrit.wikimedia.org/r/895286 [15:02:16] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/SERVICE_NAME: sync [15:02:17] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: sync [15:02:25] PROBLEM - Hadoop NodeManager on analytics1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:02:42] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: sync [15:02:43] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: sync [15:03:15] (03PS1) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) [15:03:17] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:17] PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:03:29] PROBLEM - Hadoop NodeManager on analytics1077 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:03:47] RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:03:47] RECOVERY - Hadoop NodeManager on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:03:51] PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:53] PROBLEM - Check systemd state on analytics1077 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:10] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: sync [15:04:11] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: sync [15:04:20] (03PS8) 10Jbond: sre.hardware.upgrade-firmware: Call provision cookbook after upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 [15:04:33] PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:04:37] PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:38] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: sync [15:04:39] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: sync [15:04:39] !log dns1001 - restarted prometheus-bird-exporter [15:04:44] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1021.eqiad.wmnet with reason: host reimage [15:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:09] PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:05:57] PROBLEM - Check systemd state on an-worker1098 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:59] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:06:01] PROBLEM - Check systemd state on an-worker1114 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:01] PROBLEM - Check systemd state on an-worker1109 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:01] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:11] PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:06:17] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:06:19] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: sync [15:06:20] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: sync [15:06:43] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:06:45] RECOVERY - Hadoop NodeManager on an-worker1082 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:06:47] RECOVERY - Hadoop NodeManager on analytics1064 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:06:51] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: sync [15:06:52] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum1001.eqiad.wmnet with OS bullseye [15:06:52] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: sync [15:07:01] PROBLEM - Check systemd state on an-worker1147 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:03] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host durum1001.eqiad.wmnet with OS bullseye [15:07:05] PROBLEM - Check systemd state on an-worker1079 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:07] PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:07:09] PROBLEM - Check systemd state on analytics1073 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:19] PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:37] PROBLEM - Check systemd state on an-worker1143 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:51] (03CR) 10Ottomata: [C: 03+1] "haven't looked at PCC, but hiera LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/895235 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [15:08:02] (03CR) 10Jbond: sre.hardware.upgrade-firmware: Call provision cookbook after upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 (owner: 10Jbond) [15:08:05] PROBLEM - Hadoop NodeManager on an-worker1143 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:08:47] PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:08:47] PROBLEM - Hadoop NodeManager on an-worker1109 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:09:23] PROBLEM - Check whether ferm is active by checking the default input chain on kubemaster1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:09:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1038'] [15:09:30] (03PS4) 10Jbond: sre.SREBatchRunner: add max failed argument [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 [15:09:45] RECOVERY - Hadoop NodeManager on analytics1060 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:09:45] RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:09:53] (03CR) 10Stevemunene: search airflow: configure for postgres (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895235 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [15:10:02] <_joe_> akosiaris: is kubemaster1001 you? [15:10:23] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:10:31] PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:10:44] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10cmooney) >>! In T329073#8672931, @Andrew wrote: > the following hosts paged during this maintenance: > > > ` > NodeDown wmcs cloudvirt1023:9100 (node e... [15:10:49] PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:10:49] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:11:09] (03CR) 10CDanis: "This looks reasonable to me. LGTM once joe is happy" [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [15:11:09] PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:11:11] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: sync [15:11:12] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1039'] [15:11:12] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync [15:11:27] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync [15:11:28] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: sync [15:11:45] PROBLEM - Check systemd state on analytics1058 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:49] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:03] PROBLEM - Check systemd state on an-worker1111 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:47] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:13:13] (03CR) 10Jbond: "this never got merged but i think it should be good to add. shout if no, otherwise ill merge tomorrow 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 (owner: 10Jbond) [15:13:17] PROBLEM - Check systemd state on an-worker1112 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:22] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [15:13:39] PROBLEM - Check systemd state on an-worker1083 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:39] PROBLEM - Hadoop NodeManager on analytics1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:13:39] PROBLEM - Check systemd state on an-worker1119 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:03] PROBLEM - Check systemd state on an-worker1081 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:07] PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:07] PROBLEM - Hadoop NodeManager on an-worker1100 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:07] PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:07] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:09] PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:09] PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:11] PROBLEM - Check systemd state on an-worker1146 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:17] PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:19] PROBLEM - Hadoop NodeManager on an-worker1094 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:19] PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:19] PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:21] PROBLEM - Hadoop NodeManager on an-worker1146 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:21] PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:23] PROBLEM - Hadoop NodeManager on analytics1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:23] PROBLEM - Check systemd state on analytics1064 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:25] PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:25] PROBLEM - Check systemd state on an-worker1104 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:31] PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:35] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:39] PROBLEM - Check systemd state on an-worker1084 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:41] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:43] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:43] PROBLEM - Check systemd state on analytics1071 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:43] PROBLEM - Check systemd state on an-worker1133 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:43] PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:45] PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:47] PROBLEM - Hadoop NodeManager on an-worker1101 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:47] PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:47] PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:48] PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:49] PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:50] PROBLEM - Check systemd state on an-worker1122 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:51] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:52] PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:53] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:54] PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:55] PROBLEM - Check systemd state on an-worker1096 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:56] PROBLEM - Check systemd state on an-worker1127 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:57] PROBLEM - Check systemd state on an-worker1141 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:58] PROBLEM - Hadoop NodeManager on an-worker1142 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:59] PROBLEM - Check systemd state on an-worker1117 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:00] PROBLEM - Hadoop NodeManager on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:15:01] PROBLEM - Check systemd state on an-worker1101 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:02] PROBLEM - Check systemd state on an-worker1129 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:03] PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:04] PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:05] PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:06] PROBLEM - Hadoop NodeManager on an-worker1096 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:15:07] PROBLEM - Hadoop NodeManager on analytics1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:15:08] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:15:09] PROBLEM - Check systemd state on an-worker1121 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:10] PROBLEM - Check systemd state on an-worker1138 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:11] PROBLEM - Hadoop NodeManager on an-worker1097 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:15:12] PROBLEM - Check systemd state on an-worker1086 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:15] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:17] PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:15:17] PROBLEM - Check systemd state on an-worker1094 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:17] PROBLEM - Check systemd state on an-worker1115 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:17] PROBLEM - Check systemd state on an-worker1134 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:19] PROBLEM - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:19] PROBLEM - Check systemd state on analytics1065 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:21] PROBLEM - Check systemd state on an-worker1082 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:21] PROBLEM - Check systemd state on an-worker1145 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:23] PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:15:23] PROBLEM - Check systemd state on an-worker1092 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:25] PROBLEM - Check systemd state on an-worker1097 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:25] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: sync [15:15:25] RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:26] !log akosiaris@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [15:15:26] !log akosiaris@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [15:15:33] PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:15:39] !log akosiaris@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [15:15:40] !log akosiaris@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [15:15:41] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: sync [15:15:49] RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:15:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:16:05] RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:16:13] RECOVERY - Hadoop NodeManager on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:16:14] !log pool moss-fe1001 T329073 [15:16:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:16:19] RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:16:20] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [15:16:39] !log pool ms-fe1009 T329073 [15:16:39] RECOVERY - Check systemd state on an-worker1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:39] RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage [15:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:55] RECOVERY - Hadoop NodeManager on analytics1062 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:17:01] PROBLEM - Hadoop NodeManager on an-worker1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:17:01] PROBLEM - Check systemd state on analytics1061 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2099.codfw.wmnet with reason: Maintenance [15:17:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2099.codfw.wmnet with reason: Maintenance [15:17:33] RECOVERY - Hadoop NodeManager on an-worker1134 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:17:35] PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:17:35] PROBLEM - Check systemd state on an-worker1136 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:53] RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:18:01] RECOVERY - Check systemd state on an-worker1134 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [15:18:05] RECOVERY - Hadoop NodeManager on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:18:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [15:18:25] RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:35] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:18:35] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: sync [15:18:37] RECOVERY - Check systemd state on an-worker1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:39] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:18:39] RECOVERY - Check systemd state on an-worker1098 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:39] RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:18:41] RECOVERY - Check systemd state on analytics1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:41] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:41] RECOVERY - Check systemd state on an-worker1133 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:42] RECOVERY - Check systemd state on an-worker1147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:43] RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:18:45] RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:18:45] RECOVERY - Hadoop NodeManager on an-worker1143 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:18:46] RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:18:47] RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:19:10] !log mvernon@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe1002.eqiad.wmnet,service=thanos-web [15:19:16] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: sync [15:19:17] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: sync [15:19:17] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage [15:19:25] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10cmooney) [15:19:30] !log pool thanos-fe1001 T329073 [15:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:41] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: sync [15:19:44] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: sync [15:19:45] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: sync [15:20:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [15:20:19] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/termbox: sync [15:20:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [15:20:31] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1039'] [15:20:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45198 and previous config saved to /var/cache/conftool/dbconfig/20230307-152037-marostegui.json [15:20:41] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: sync [15:20:42] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/push-notifications: sync [15:20:44] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [15:20:56] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: sync [15:20:56] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1039'] [15:20:57] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [15:21:11] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1039'] [15:21:29] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: sync [15:21:30] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: sync [15:21:43] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [15:21:51] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1021.eqiad.wmnet with OS bullseye [15:21:59] RECOVERY - Maps HTTPS on maps1007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:22:11] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/similar-users: sync [15:22:12] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: sync [15:22:13] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [15:22:37] (03CR) 10Nicolas Fraison: [C: 03+1] Reenable the gobblin timers after switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/895239 (https://phabricator.wikimedia.org/T329073) (owner: 10Btullis) [15:22:39] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/similar-users: sync [15:22:46] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [15:22:47] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: sync [15:22:49] RECOVERY - Maps HTTPS on maps1008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:22:51] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:22:53] RECOVERY - Maps HTTPS on maps1010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:22:56] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: sync [15:22:57] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: sync [15:23:15] (03CR) 10Btullis: [C: 03+2] Reenable the gobblin timers after switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/895239 (https://phabricator.wikimedia.org/T329073) (owner: 10Btullis) [15:23:17] (03CR) 10Volans: [C: 03+1] "LGTM, one question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [15:23:41] PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:43] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: sync [15:23:44] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: sync [15:23:47] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:23:56] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 (owner: 10Jbond) [15:23:57] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: sync [15:23:58] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: sync [15:24:13] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: sync [15:24:14] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: sync [15:24:30] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: sync [15:24:31] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: sync [15:24:47] dcausse: rdf-streaming updater got redeployed a few mins ago [15:24:51] I think you should be good to go [15:24:56] akosiaris: thanks! [15:24:58] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: sync [15:24:59] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/similar-users: sync [15:25:01] (03CR) 10Muehlenhoff: mod_auth_cas: add logout script for mod_auth_cas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695255 (owner: 10Jbond) [15:25:01] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/similar-users: sync [15:25:02] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [15:25:05] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [15:25:06] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/termbox: sync [15:25:09] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: sync [15:25:10] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [15:25:13] RECOVERY - Check systemd state on durum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:19] dcausse: IIRC we want to wait it to catch up before we repool wdqs, right ? [15:25:30] (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/895135 (https://phabricator.wikimedia.org/T331345) (owner: 10Ottomata) [15:25:34] !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-eqiad cluster: Roll restart of jvm daemons. [15:25:37] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [15:25:43] RECOVERY - Check systemd state on prometheus2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45199 and previous config saved to /var/cache/conftool/dbconfig/20230307-152545-marostegui.json [15:25:51] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [15:25:56] akosiaris: yes, but I think inflatador can take care of the repool once the lag is back to normal [15:26:06] ah, awesome. Thanks for that! [15:26:14] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [15:26:16] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: sync [15:26:18] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: sync [15:26:19] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: sync [15:26:22] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: sync [15:26:23] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: sync [15:26:23] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir5001.eqsin.wmnet with OS bullseye [15:26:26] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: sync [15:26:33] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir5001.eqsin.wmnet with OS bullseye [15:26:47] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:26:51] RECOVERY - Hadoop NodeManager on an-worker1144 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:26:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2106.codfw.wmnet with reason: Maintenance [15:27:17] (03CR) 10Volans: "LGTM, just one wording nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 (owner: 10Jbond) [15:27:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2106.codfw.wmnet with reason: Maintenance [15:27:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T328817)', diff saved to https://phabricator.wikimedia.org/P45200 and previous config saved to /var/cache/conftool/dbconfig/20230307-152729-marostegui.json [15:27:36] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:27:53] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:28:27] RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:28:28] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10MatthewVernon) ...the icinga warning was systemd timing out waiting for smartd to start up (takes about 2 minutes). [15:28:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance [15:28:54] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1022.eqiad.wmnet with OS bullseye [15:28:55] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:29:03] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:29:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance [15:29:13] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:29:21] !log installing libde265 security updates [15:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:05] (03PS1) 10MVernon: smart: override unit to make systemd wait longer [puppet] - 10https://gerrit.wikimedia.org/r/895309 (https://phabricator.wikimedia.org/T326352) [15:30:07] (03PS1) 10Jelto: gitlab: production host needs additional flag for restore [puppet] - 10https://gerrit.wikimedia.org/r/895310 (https://phabricator.wikimedia.org/T331295) [15:30:28] (03CR) 10CI reject: [V: 04-1] smart: override unit to make systemd wait longer [puppet] - 10https://gerrit.wikimedia.org/r/895309 (https://phabricator.wikimedia.org/T326352) (owner: 10MVernon) [15:30:45] RECOVERY - MariaDB Replica IO: es5 on es1023 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:30:57] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host durum1001.eqiad.wmnet with OS bullseye [15:31:07] RECOVERY - MariaDB Replica Lag: es5 on es1023 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:31:08] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum1001.eqiad.wmnet with OS bullseye completed: - durum1001 (**PASS**) - Downtimed on Icinga... [15:31:15] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [15:31:39] (03PS2) 10MVernon: smart: override unit to make systemd wait longer [puppet] - 10https://gerrit.wikimedia.org/r/895309 (https://phabricator.wikimedia.org/T326352) [15:32:34] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/895309 (https://phabricator.wikimedia.org/T326352) (owner: 10MVernon) [15:32:53] RECOVERY - Check systemd state on kubemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:58] (KubernetesCalicoDown) firing: (2) kubernetes1021.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:34:15] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:25] RECOVERY - Maps HTTPS on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:34:43] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1040'] [15:35:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:35:13] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:35:14] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:36:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.621 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:36:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49709 bytes in 0.590 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:36:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:36:56] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum1002.eqiad.wmnet with OS bullseye [15:37:07] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host durum1002.eqiad.wmnet with OS bullseye [15:37:49] RECOVERY - Check systemd state on ms-be1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:53] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:38:13] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1044 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:38:40] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:38:55] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:15] (JobUnavailable) firing: (5) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:35] RECOVERY - Check whether ferm is active by checking the default input chain on kubemaster1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:39:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [15:39:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [15:40:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [15:40:31] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:40:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T328817)', diff saved to https://phabricator.wikimedia.org/P45201 and previous config saved to /var/cache/conftool/dbconfig/20230307-154034-marostegui.json [15:40:43] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:40:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [15:40:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T329203)', diff saved to https://phabricator.wikimedia.org/P45202 and previous config saved to /var/cache/conftool/dbconfig/20230307-154049-marostegui.json [15:40:56] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:40:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P45203 and previous config saved to /var/cache/conftool/dbconfig/20230307-154058-marostegui.json [15:41:11] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:41:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:41:59] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1022.eqiad.wmnet with reason: host reimage [15:42:01] ACKNOWLEDGEMENT - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service MVernon being worked on - T327253 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1040'] [15:42:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:42:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:43:03] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (5) wdqs1007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:44:15] (JobUnavailable) resolved: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:44:35] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1022.eqiad.wmnet with reason: host reimage [15:44:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [15:45:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/895309 (https://phabricator.wikimedia.org/T326352) (owner: 10MVernon) [15:46:24] (03CR) 10Vgutierrez: [C: 03+1] sre.loadbalancer.restart-pybal: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895212 (owner: 10Volans) [15:46:34] (03CR) 10Herron: "Seems fine to auto restart this service to keep it up to date, but overall I don't think we actually need it running. I'm not aware of an" [puppet] - 10https://gerrit.wikimedia.org/r/895144 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:46:37] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1040'] [15:46:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage [15:47:37] (03PS1) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/895313 (https://phabricator.wikimedia.org/T330165) [15:47:44] (03CR) 10MVernon: smart: override unit to make systemd wait longer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895309 (https://phabricator.wikimedia.org/T326352) (owner: 10MVernon) [15:47:48] (03CR) 10MVernon: [C: 03+2] smart: override unit to make systemd wait longer [puppet] - 10https://gerrit.wikimedia.org/r/895309 (https://phabricator.wikimedia.org/T326352) (owner: 10MVernon) [15:47:49] RECOVERY - Check systemd state on kubernetes1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (7) wdqs1005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:48:49] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [15:49:19] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:49:25] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir5001.eqsin.wmnet with reason: host reimage [15:49:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage [15:50:33] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 12 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:50:36] (03CR) 10Marostegui: [C: 03+1] Revert "clouddb: depool clouddb[1013-1014]" [puppet] - 10https://gerrit.wikimedia.org/r/895286 (owner: 10FNegri) [15:50:42] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:51:26] (03CR) 10Nicolas Fraison: "Looks to work well: https://grafana.wikimedia.org/d/000000379/hive?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=analy" [puppet] - 10https://gerrit.wikimedia.org/r/894483 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [15:51:37] (03CR) 10FNegri: [C: 03+2] Revert "clouddb: depool clouddb[1013-1014]" [puppet] - 10https://gerrit.wikimedia.org/r/895286 (owner: 10FNegri) [15:51:56] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) @cmooney I update the table with lengths between all the racks. [15:52:36] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir5001.eqsin.wmnet with reason: host reimage [15:53:09] (03CR) 10Elukey: [C: 03+1] "nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/894483 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [15:53:23] RECOVERY - Hadoop NodeManager on an-worker1122 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:53:27] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/895313 (https://phabricator.wikimedia.org/T330165) (owner: 10Marostegui) [15:53:39] !log Failover m1-master T330165 [15:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:45] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [15:54:19] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:54:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T329203)', diff saved to https://phabricator.wikimedia.org/P45204 and previous config saved to /var/cache/conftool/dbconfig/20230307-155428-marostegui.json [15:54:35] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:54:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wcqs1002:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:55:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P45205 and previous config saved to /var/cache/conftool/dbconfig/20230307-155541-marostegui.json [15:56:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P45206 and previous config saved to /var/cache/conftool/dbconfig/20230307-155604-marostegui.json [15:56:31] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1040'] [15:56:55] (03PS1) 10Btullis: Add a dummy keytab for an-airflow1005 [labs/private] - 10https://gerrit.wikimedia.org/r/895317 (https://phabricator.wikimedia.org/T327970) [15:57:29] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add a dummy keytab for an-airflow1005 [labs/private] - 10https://gerrit.wikimedia.org/r/895317 (https://phabricator.wikimedia.org/T327970) (owner: 10Btullis) [15:58:01] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on alert1001 is OK: (C)0 le (W)100 le 147.9 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:58:05] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/895235 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [15:58:15] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:58:27] RECOVERY - Check systemd state on registry1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:39] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on alert1001 is OK: (C)0 le (W)100 le 148 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:59:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wcqs1002:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:00:09] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:01:09] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:01:27] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1022.eqiad.wmnet with OS bullseye [16:02:10] (03PS2) 10Bking: search airflow: configure for postgres [puppet] - 10https://gerrit.wikimedia.org/r/895235 (https://phabricator.wikimedia.org/T327970) [16:02:29] (03CR) 10Ottomata: search-airflow: add analytics sql replica creds (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [16:03:12] (03CR) 10Stevemunene: [C: 03+1] search airflow: configure for postgres (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895235 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [16:03:14] (03CR) 10Jbond: [C: 03+2] sre.hosts.reboot-single: add ability to enable host on reboot (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [16:03:58] (KubernetesCalicoDown) resolved: kubernetes1022.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=kubernetes1022.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:04:11] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host durum1002.eqiad.wmnet with OS bullseye [16:04:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum1002.eqiad.wmnet with OS bullseye completed: - durum1002 (**PASS**) - Downtimed on Icinga... [16:04:57] (03CR) 10Muehlenhoff: logstash: Enable profile::auto_restarts::service for apache2-htcacheclean (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895144 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:04:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wcqs1002:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:05:09] (03Merged) 10jenkins-bot: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [16:05:22] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/895235 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [16:06:36] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10colewhite) [16:06:58] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [16:08:25] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1037'] [16:09:09] (03CR) 10Btullis: [C: 03+1] search airflow: configure for postgres [puppet] - 10https://gerrit.wikimedia.org/r/895235 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [16:09:15] (03CR) 10Bking: [C: 03+2] search airflow: configure for postgres [puppet] - 10https://gerrit.wikimedia.org/r/895235 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [16:09:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P45207 and previous config saved to /var/cache/conftool/dbconfig/20230307-160935-marostegui.json [16:10:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P45208 and previous config saved to /var/cache/conftool/dbconfig/20230307-161047-marostegui.json [16:11:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45209 and previous config saved to /var/cache/conftool/dbconfig/20230307-161111-marostegui.json [16:11:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1122.eqiad.wmnet with reason: Maintenance [16:11:21] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [16:11:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1122.eqiad.wmnet with reason: Maintenance [16:11:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T329260)', diff saved to https://phabricator.wikimedia.org/P45210 and previous config saved to /var/cache/conftool/dbconfig/20230307-161132-marostegui.json [16:11:42] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) @MatthewVernon thank you I will try on ms-be2071 and let you know [16:12:47] (03CR) 10MVernon: [C: 03+2] smart: override unit to make systemd wait longer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895309 (https://phabricator.wikimedia.org/T326352) (owner: 10MVernon) [16:14:06] (03CR) 10Jbond: [C: 04-1] "going to -1 this for now as per Moritz's comments. it might be useful but at the same time its extra code etc" [puppet] - 10https://gerrit.wikimedia.org/r/695255 (owner: 10Jbond) [16:15:37] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:16:14] (03CR) 10Muehlenhoff: mod_auth_cas: add logout script for mod_auth_cas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695255 (owner: 10Jbond) [16:16:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T329260)', diff saved to https://phabricator.wikimedia.org/P45211 and previous config saved to /var/cache/conftool/dbconfig/20230307-161634-marostegui.json [16:16:41] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [16:17:07] RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:10] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1037'] [16:17:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:17:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wdqs1009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:18:03] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (7) wdqs1005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:20:15] 10SRE-tools, 10Infrastructure-Foundations: cookbooks: sre.hosts.reboot-single update to support disabled puppet - https://phabricator.wikimedia.org/T325153 (10jbond) 05Open→03Resolved reboot singloe cookbook now updated [16:21:53] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ncredir5001.eqsin.wmnet with OS bullseye [16:22:04] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir5001.eqsin.wmnet with OS bullseye completed: - ncredir5001 (**PASS**) - Downtimed on Ic... [16:23:58] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: service=kubesvc,name=kubernetes2016.codfw.wmnet [16:24:21] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [16:24:25] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [16:24:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P45212 and previous config saved to /var/cache/conftool/dbconfig/20230307-162442-marostegui.json [16:25:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2071.codfw.wmnet with OS bullseye [16:25:08] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum2001.codfw.wmnet with OS bullseye [16:25:13] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2071.codfw.wmnet with OS bullseye [16:25:20] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host durum2001.codfw.wmnet with OS bullseye [16:25:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T328817)', diff saved to https://phabricator.wikimedia.org/P45213 and previous config saved to /var/cache/conftool/dbconfig/20230307-162554-marostegui.json [16:25:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:26:01] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [16:26:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:26:10] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-eqiad cluster: Roll restart of jvm daemons. [16:26:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T328817)', diff saved to https://phabricator.wikimedia.org/P45214 and previous config saved to /var/cache/conftool/dbconfig/20230307-162616-marostegui.json [16:28:00] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [16:29:56] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:31:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P45215 and previous config saved to /var/cache/conftool/dbconfig/20230307-163140-marostegui.json [16:32:36] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:32:36] RECOVERY - Checks that the airflow database for airflow search is working properly on an-airflow1005 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:35:08] RECOVERY - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:36:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage [16:38:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T328817)', diff saved to https://phabricator.wikimedia.org/P45216 and previous config saved to /var/cache/conftool/dbconfig/20230307-163813-marostegui.json [16:38:20] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [16:39:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T329203)', diff saved to https://phabricator.wikimedia.org/P45217 and previous config saved to /var/cache/conftool/dbconfig/20230307-163948-marostegui.json [16:39:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [16:39:51] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage [16:39:55] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [16:40:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [16:40:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T329203)', diff saved to https://phabricator.wikimedia.org/P45218 and previous config saved to /var/cache/conftool/dbconfig/20230307-164010-marostegui.json [16:43:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:28] (03PS9) 10Jbond: sre.hardware.upgrade-firmware: Call provision cookbook after upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 [16:44:52] (03CR) 10Jbond: "updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 (owner: 10Jbond) [16:45:51] (03PS5) 10Jbond: sre.SREBatchRunner: add max failed argument [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 [16:46:01] (03CR) 10Jbond: sre.SREBatchRunner: add max failed argument (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 (owner: 10Jbond) [16:46:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P45219 and previous config saved to /var/cache/conftool/dbconfig/20230307-164647-marostegui.json [16:47:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2071.codfw.wmnet with reason: host reimage [16:48:01] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:48:47] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 179, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:48:59] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 (owner: 10Jbond) [16:49:21] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:51:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2071.codfw.wmnet with reason: host reimage [16:52:13] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10thcipriani) [16:52:20] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host durum2001.codfw.wmnet with OS bullseye [16:52:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum2001.codfw.wmnet with OS bullseye completed: - durum2001 (**PASS**) - Downtimed on Icinga... [16:52:46] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P45220 and previous config saved to /var/cache/conftool/dbconfig/20230307-165319-marostegui.json [16:53:23] !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@9924c93]: (no justification provided) [16:53:27] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [16:53:35] !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@9924c93]: (no justification provided) (duration: 00m 11s) [16:53:39] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum2002.codfw.wmnet with OS bullseye [16:53:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T329203)', diff saved to https://phabricator.wikimedia.org/P45221 and previous config saved to /var/cache/conftool/dbconfig/20230307-165340-marostegui.json [16:53:51] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [16:53:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host durum2002.codfw.wmnet with OS bullseye [16:57:00] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum3001.esams.wmnet with OS bullseye [16:57:12] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host durum3001.esams.wmnet with OS bullseye [16:57:40] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:58:20] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=eqiad [16:58:36] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:00:00] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:00:05] jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T1700) [17:00:06] No Gerrit patches in the queue for this window AFAICS. [17:00:38] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:00:42] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:00:54] PROBLEM - Check systemd state on restbase2019 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:32] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:01:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T329260)', diff saved to https://phabricator.wikimedia.org/P45222 and previous config saved to /var/cache/conftool/dbconfig/20230307-170154-marostegui.json [17:01:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance [17:02:01] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [17:02:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance [17:02:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T329260)', diff saved to https://phabricator.wikimedia.org/P45223 and previous config saved to /var/cache/conftool/dbconfig/20230307-170215-marostegui.json [17:02:56] PROBLEM - cassandra-a CQL 10.192.16.98:9042 on restbase2019 is CRITICAL: connect to address 10.192.16.98 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:03:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T329260)', diff saved to https://phabricator.wikimedia.org/P45224 and previous config saved to /var/cache/conftool/dbconfig/20230307-170328-marostegui.json [17:03:51] (03CR) 10Jbond: [C: 03+2] sre.SREBatchRunner: add max failed argument [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 (owner: 10Jbond) [17:04:14] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:04:56] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:05:28] (03PS5) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) [17:05:31] (03PS1) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) [17:05:44] (03Merged) 10jenkins-bot: sre.SREBatchRunner: add max failed argument [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 (owner: 10Jbond) [17:06:24] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) @MatthewVernon looks like ms-be2071 is happy second reboot got the server back into the OS so just waiting for it to finish now. [17:06:34] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:06:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage [17:08:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P45225 and previous config saved to /var/cache/conftool/dbconfig/20230307-170826-marostegui.json [17:08:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P45226 and previous config saved to /var/cache/conftool/dbconfig/20230307-170848-marostegui.json [17:09:55] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage [17:11:21] RECOVERY - cassandra-a CQL 10.192.16.98:9042 on restbase2019 is OK: TCP OK - 0.032 second response time on 10.192.16.98 port 9042 https://phabricator.wikimedia.org/T93886 [17:12:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3001.esams.wmnet with reason: host reimage [17:15:51] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3001.esams.wmnet with reason: host reimage [17:18:02] (03CR) 10Volans: [C: 03+2] sre.network: fix minor bugs and type hints [cookbooks] - 10https://gerrit.wikimedia.org/r/895185 (owner: 10Volans) [17:18:32] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:18:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P45227 and previous config saved to /var/cache/conftool/dbconfig/20230307-171834-marostegui.json [17:18:40] RECOVERY - Check systemd state on restbase2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:49] jouncebot: now [17:18:49] For the next 0 hour(s) and 41 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T1700) [17:19:20] (03CR) 10Joal: [C: 04-1] analytics::refinery::job::eventlogging_to_druid: Default to deploy-mode cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895228 (owner: 10Mforns) [17:19:56] (03CR) 10Volans: [C: 03+2] sre.loadbalancer.restart-pybal: simplify call [cookbooks] - 10https://gerrit.wikimedia.org/r/895184 (owner: 10Volans) [17:21:02] 10SRE, 10ops-esams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [17:21:34] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir5002.eqsin.wmnet [17:22:04] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir5002.eqsin.wmnet with OS bullseye [17:22:06] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:22:15] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir5002.eqsin.wmnet with OS bullseye [17:22:33] (03CR) 10Ahmon Dancy: mwdebug_deploy: remove configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche) [17:23:16] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:23:26] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:23:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T328817)', diff saved to https://phabricator.wikimedia.org/P45228 and previous config saved to /var/cache/conftool/dbconfig/20230307-172333-marostegui.json [17:23:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2119.codfw.wmnet with reason: Maintenance [17:23:40] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:23:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host durum2002.codfw.wmnet with OS bullseye [17:23:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2119.codfw.wmnet with reason: Maintenance [17:23:54] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum2002.codfw.wmnet with OS bullseye completed: - durum2002 (**PASS**) - Downtimed on Icinga... [17:23:54] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:23:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T328817)', diff saved to https://phabricator.wikimedia.org/P45230 and previous config saved to /var/cache/conftool/dbconfig/20230307-172354-marostegui.json [17:23:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P45229 and previous config saved to /var/cache/conftool/dbconfig/20230307-172354-marostegui.json [17:24:19] 10SRE, 10LDAP-Access-Requests: Request access to the group ldap/wmf - https://phabricator.wikimedia.org/T331370 (10lwatson) [17:24:44] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 179, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:25:03] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) @MatthewVernon puppet is failing with the error below on ms-be2071 ` Error: '/usr/sbin/mkfs -t xfs -m crc=1 -m finobt=0 -i size=512 /dev/disk/... [17:25:18] 10SRE, 10LDAP-Access-Requests: Request access to the group ldap/wmf - https://phabricator.wikimedia.org/T331370 (10lwatson) 05Stalled→03Open [17:25:50] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:27:30] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:29:17] jouncebot nowandnext [17:29:17] For the next 0 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T1700) [17:29:17] In 0 hour(s) and 30 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T1800) [17:29:45] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:29:47] (03Abandoned) 10AOkoth: vrts: mask/unmask services on non-active host [puppet] - 10https://gerrit.wikimedia.org/r/894086 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [17:30:39] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:31:12] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host durum3001.esams.wmnet with OS bullseye [17:31:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum3001.esams.wmnet with OS bullseye completed: - durum3001 (**PASS**) - Downtimed on Icinga... [17:32:05] jouncebot: nowandnext [17:32:05] For the next 0 hour(s) and 27 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T1700) [17:32:05] In 0 hour(s) and 27 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T1800) [17:32:22] TheresNoTime: I'm messing w/ the deployment server at the moment. [17:32:38] ack, not thinking of doing any deploys :) [17:33:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P45231 and previous config saved to /var/cache/conftool/dbconfig/20230307-173341-marostegui.json [17:34:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T328817)', diff saved to https://phabricator.wikimedia.org/P45232 and previous config saved to /var/cache/conftool/dbconfig/20230307-173453-marostegui.json [17:35:00] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:35:11] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+1] analytics::refinery::job::eventlogging_to_druid: Default to deploy-mode cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895228 (owner: 10Mforns) [17:38:03] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:39:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T329203)', diff saved to https://phabricator.wikimedia.org/P45233 and previous config saved to /var/cache/conftool/dbconfig/20230307-173901-marostegui.json [17:39:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [17:39:08] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [17:39:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [17:39:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T329203)', diff saved to https://phabricator.wikimedia.org/P45234 and previous config saved to /var/cache/conftool/dbconfig/20230307-173923-marostegui.json [17:39:28] (03PS1) 10AOkoth: vrts: copy data to passive host [puppet] - 10https://gerrit.wikimedia.org/r/895334 (https://phabricator.wikimedia.org/T323515) [17:39:45] (JobUnavailable) resolved: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:40:07] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:40:07] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum3002.esams.wmnet with OS bullseye [17:40:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:40:37] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host durum3002.esams.wmnet with OS bullseye [17:41:36] (03Merged) 10jenkins-bot: sre.loadbalancer.restart-pybal: simplify call [cookbooks] - 10https://gerrit.wikimedia.org/r/895184 (owner: 10Volans) [17:41:38] (03Merged) 10jenkins-bot: sre.network: fix minor bugs and type hints [cookbooks] - 10https://gerrit.wikimedia.org/r/895185 (owner: 10Volans) [17:41:59] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:43:41] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:44:05] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:44:07] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir5002.eqsin.wmnet with reason: host reimage [17:44:45] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:45:01] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:45:50] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash: Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10RLazarus) p:05Triage→03High [17:46:15] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10MatthewVernon) Yeah, I saw similar on ms-be2070; the problem being the disk isn't entirely blank. I suspect ` sudo wipefs -a /dev/disk/by-path/pci-0000... [17:46:35] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash: Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10RLazarus) a:03lmata [17:47:24] !log volans@cumin1001 START - Cookbook sre.network.cf [17:47:24] !log volans@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [17:47:29] (03PS1) 10Cmjohnson: Adding new cloudcephosd hosts to site.pp insetup::nofirm [puppet] - 10https://gerrit.wikimedia.org/r/894582 (https://phabricator.wikimedia.org/T324998) [17:47:37] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir5002.eqsin.wmnet with reason: host reimage [17:48:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T329260)', diff saved to https://phabricator.wikimedia.org/P45235 and previous config saved to /var/cache/conftool/dbconfig/20230307-174848-marostegui.json [17:48:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [17:48:54] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [17:49:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [17:50:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P45236 and previous config saved to /var/cache/conftool/dbconfig/20230307-175000-marostegui.json [17:50:06] (03PS2) 10Cmjohnson: Adding new cloudcephosd hosts to site.pp insetup::nofirm [puppet] - 10https://gerrit.wikimedia.org/r/894582 (https://phabricator.wikimedia.org/T324998) [17:50:51] (03CR) 10Cmjohnson: [C: 03+2] Adding new cloudcephosd hosts to site.pp insetup::nofirm [puppet] - 10https://gerrit.wikimedia.org/r/894582 (https://phabricator.wikimedia.org/T324998) (owner: 10Cmjohnson) [17:51:50] !log bking@cumin2002 repool wdqs hosts post-maintenance T329073 [17:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:56] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [17:52:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T329203)', diff saved to https://phabricator.wikimedia.org/P45237 and previous config saved to /var/cache/conftool/dbconfig/20230307-175251-marostegui.json [17:52:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:52:58] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [17:53:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:53:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45238 and previous config saved to /var/cache/conftool/dbconfig/20230307-175314-marostegui.json [17:55:39] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash: Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10lmata) Thanks! I'll discuss this with the team and circle back with: - Confirmation that our assumptions on capacity and metrics - Next steps to remediate whatever aff... [17:56:10] (03PS1) 10JMeybohm: Remove the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/895336 (https://phabricator.wikimedia.org/T326729) [17:57:07] (03CR) 10CI reject: [V: 04-1] Remove the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/895336 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [17:57:21] (03PS2) 10JMeybohm: Remove the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/895336 (https://phabricator.wikimedia.org/T326729) [17:57:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3002.esams.wmnet with reason: host reimage [17:58:35] (03CR) 10CI reject: [V: 04-1] Remove the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/895336 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [17:59:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45239 and previous config saved to /var/cache/conftool/dbconfig/20230307-175907-marostegui.json [17:59:14] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T1800) [18:01:35] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Again `00961: FAILED: internal_api_error_UploadChunkFileException: [91359c72-48cc-4405-98c8-bb5f4... [18:01:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3002.esams.wmnet with reason: host reimage [18:02:14] (03PS3) 10JMeybohm: Remove the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/895336 (https://phabricator.wikimedia.org/T326729) [18:05:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P45240 and previous config saved to /var/cache/conftool/dbconfig/20230307-180506-marostegui.json [18:05:28] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:05:31] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) Thanks that fixed the issue. [18:07:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P45241 and previous config saved to /var/cache/conftool/dbconfig/20230307-180757-marostegui.json [18:10:34] (03CR) 10Ebernhardson: search-airflow: add analytics sql replica creds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [18:12:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bullseye [18:12:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1035.eqiad.wmnet with OS bullseye [18:14:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P45242 and previous config saved to /var/cache/conftool/dbconfig/20230307-181414-marostegui.json [18:14:22] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:15:44] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:16:15] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ncredir5002.eqsin.wmnet with OS bullseye [18:16:26] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir5002.eqsin.wmnet with OS bullseye completed: - ncredir5002 (**PASS**) - Downtimed on Ic... [18:16:44] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:17:34] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:17:36] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [18:17:55] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host durum3002.esams.wmnet with OS bullseye [18:18:05] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum3002.esams.wmnet with OS bullseye completed: - durum3002 (**PASS**) - Downtimed on Icinga... [18:18:56] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum4001.ulsfo.wmnet with OS bullseye [18:19:13] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum5001.eqsin.wmnet with OS bullseye [18:19:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host durum5001.eqsin.wmnet with OS bullseye [18:20:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T328817)', diff saved to https://phabricator.wikimedia.org/P45243 and previous config saved to /var/cache/conftool/dbconfig/20230307-182013-marostegui.json [18:20:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance [18:20:21] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:20:25] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum6001.drmrs.wmnet with OS bullseye [18:20:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance [18:20:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T328817)', diff saved to https://phabricator.wikimedia.org/P45244 and previous config saved to /var/cache/conftool/dbconfig/20230307-182035-marostegui.json [18:20:36] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host durum6001.drmrs.wmnet with OS bullseye [18:22:47] (03CR) 10Nicolas Fraison: "Not nice at all we now have an OOM 😊" [puppet] - 10https://gerrit.wikimedia.org/r/894483 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [18:22:51] (03CR) 10Nicolas Fraison: [C: 04-2] hive: Fix max metaspace size of hiveserver2 prod to 512m [puppet] - 10https://gerrit.wikimedia.org/r/894483 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [18:23:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P45245 and previous config saved to /var/cache/conftool/dbconfig/20230307-182304-marostegui.json [18:23:12] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:24:08] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:24:10] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:24:12] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:24:26] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:24:40] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:25:22] (03Abandoned) 10Nicolas Fraison: hive: Fix max metaspace size of hiveserver2 prod to 512m [puppet] - 10https://gerrit.wikimedia.org/r/894483 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [18:25:44] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 476, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:26:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:26:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2071.codfw.wmnet with OS bullseye [18:26:12] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:26:16] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2071.codfw.wmnet with OS bullseye completed: - ms-be... [18:26:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2072.codfw.wmnet with OS bullseye [18:27:07] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2072.codfw.wmnet with OS bullseye [18:29:09] !log dancy@deploy2002: Fixing up /srv/mediawiki-staging/.git permissions [18:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P45246 and previous config saved to /var/cache/conftool/dbconfig/20230307-182921-marostegui.json [18:31:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T328817)', diff saved to https://phabricator.wikimedia.org/P45247 and previous config saved to /var/cache/conftool/dbconfig/20230307-183136-marostegui.json [18:31:44] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:32:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage [18:33:12] (03CR) 10JMeybohm: [C: 04-1] "Sorry 😇" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [18:35:07] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir5002.eqsin.wmnet [18:35:20] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage [18:35:21] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:37:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [18:38:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T329203)', diff saved to https://phabricator.wikimedia.org/P45248 and previous config saved to /var/cache/conftool/dbconfig/20230307-183810-marostegui.json [18:38:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [18:38:18] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [18:38:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [18:39:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage [18:39:16] 10SRE, 10LDAP-Access-Requests: Request access to the group ldap/wmf - https://phabricator.wikimedia.org/T331370 (10Aklapper) 05Open→03Stalled @lwatson: Thanks. Could you please provide the purpose why you need to be added to `ldap/wmf`, and provide a contact person within WMF? [18:39:17] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir6001.eqsin.wmnet [18:39:22] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir6001.eqsin.wmnet [18:39:36] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir6001.drmrs.wmnet with OS bullseye [18:39:48] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir6001.drmrs.wmnet with OS bullseye [18:40:52] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [18:43:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage [18:44:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45249 and previous config saved to /var/cache/conftool/dbconfig/20230307-184428-marostegui.json [18:44:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [18:44:35] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [18:44:39] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:44:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [18:44:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:44:53] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:45:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:45:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T329260)', diff saved to https://phabricator.wikimedia.org/P45250 and previous config saved to /var/cache/conftool/dbconfig/20230307-184506-marostegui.json [18:45:13] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:45:39] (03CR) 10Dzahn: vrts: copy data to passive host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895334 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [18:46:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P45251 and previous config saved to /var/cache/conftool/dbconfig/20230307-184642-marostegui.json [18:46:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2072.codfw.wmnet with reason: host reimage [18:46:59] !log jhuneidi@deploy2002 Started scap: testwikis wikis to 1.40.0-wmf.26 refs T330204 [18:47:06] T330204: 1.40.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T330204 [18:47:14] (JobUnavailable) firing: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:47:49] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:48:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host durum4001.ulsfo.wmnet with OS bullseye [18:48:28] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum4001.ulsfo.wmnet with OS bullseye completed: - durum4001 (**PASS**) - Downtimed on Icinga... [18:48:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [18:48:59] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:49:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [18:49:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T329203)', diff saved to https://phabricator.wikimedia.org/P45252 and previous config saved to /var/cache/conftool/dbconfig/20230307-184907-marostegui.json [18:49:14] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [18:49:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2072.codfw.wmnet with reason: host reimage [18:50:28] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:50:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T329260)', diff saved to https://phabricator.wikimedia.org/P45253 and previous config saved to /var/cache/conftool/dbconfig/20230307-185058-marostegui.json [18:51:05] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [18:51:37] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:51:54] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:52:14] (JobUnavailable) resolved: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:52:26] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:54:30] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:55:40] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:56:04] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:56:16] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host durum6001.drmrs.wmnet with OS bullseye [18:56:26] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum6001.drmrs.wmnet with OS bullseye executed with errors: - durum6001 (**FAIL**) - Downtime... [18:57:33] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Again with the same file `01189: FAILED: internal_api_error_UploadChunkFileException: [8351713a-b... [18:57:49] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir6001.drmrs.wmnet with reason: host reimage [18:59:38] !log jhuneidi@deploy2002 Finished scap: testwikis wikis to 1.40.0-wmf.26 refs T330204 (duration: 12m 38s) [18:59:44] T330204: 1.40.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T330204 [18:59:55] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) p:05Triage→03Low [19:00:05] jeena and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T1900). [19:00:06] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:00:22] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) [19:00:28] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [19:01:02] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:01:10] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:01:35] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host durum5001.eqsin.wmnet with OS bullseye [19:01:46] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum5001.eqsin.wmnet with OS bullseye completed: - durum5001 (**PASS**) - Downtimed on Icinga... [19:01:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P45254 and previous config saved to /var/cache/conftool/dbconfig/20230307-190149-marostegui.json [19:01:51] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir6001.drmrs.wmnet with reason: host reimage [19:03:09] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) [19:03:28] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [19:03:34] (03CR) 10Dzahn: [C: 03+1] gitlab: production host needs additional flag for restore [puppet] - 10https://gerrit.wikimedia.org/r/895310 (https://phabricator.wikimedia.org/T331295) (owner: 10Jelto) [19:03:36] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [19:03:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T329203)', diff saved to https://phabricator.wikimedia.org/P45255 and previous config saved to /var/cache/conftool/dbconfig/20230307-190353-marostegui.json [19:04:00] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [19:04:09] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum6001.drmrs.wmnet with OS bullseye [19:04:20] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host durum6001.drmrs.wmnet with OS bullseye [19:05:42] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:06:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P45256 and previous config saved to /var/cache/conftool/dbconfig/20230307-190604-marostegui.json [19:06:18] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum4002.ulsfo.wmnet with OS bullseye [19:06:29] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host durum4002.ulsfo.wmnet with OS bullseye [19:06:31] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum5002.eqsin.wmnet with OS bullseye [19:06:41] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host durum5002.eqsin.wmnet with OS bullseye [19:07:48] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:08:07] ^ expected [19:08:25] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1035.eqiad.wmnet with OS bullseye [19:08:30] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:08:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1035.eqiad.wmnet with OS bullseye execu... [19:10:48] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:10:48] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:10:58] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:11:23] all are expected [19:11:29] such is life [19:12:11] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:13:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:13:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2072.codfw.wmnet with OS bullseye [19:13:44] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:13:46] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2072.codfw.wmnet with OS bullseye completed: - ms-be... [19:16:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T328817)', diff saved to https://phabricator.wikimedia.org/P45257 and previous config saved to /var/cache/conftool/dbconfig/20230307-191656-marostegui.json [19:16:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [19:17:03] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [19:17:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [19:17:14] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4002.ulsfo.wmnet with reason: host reimage [19:17:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T328817)', diff saved to https://phabricator.wikimedia.org/P45258 and previous config saved to /var/cache/conftool/dbconfig/20230307-191717-marostegui.json [19:17:47] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [19:17:54] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:19:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P45259 and previous config saved to /var/cache/conftool/dbconfig/20230307-191900-marostegui.json [19:19:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4002.ulsfo.wmnet with reason: host reimage [19:21:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P45260 and previous config saved to /var/cache/conftool/dbconfig/20230307-192111-marostegui.json [19:21:22] !log brett@cumin2002 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host ncredir6001.drmrs.wmnet with OS bullseye [19:21:32] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir6001.drmrs.wmnet with OS bullseye executed with errors: - ncredir6001 (**FAIL**) - Down... [19:21:45] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:22:17] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [19:28:11] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:28:17] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:28:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T328817)', diff saved to https://phabricator.wikimedia.org/P45261 and previous config saved to /var/cache/conftool/dbconfig/20230307-192833-marostegui.json [19:28:40] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [19:28:41] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:29:01] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:29:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5002.eqsin.wmnet with reason: host reimage [19:29:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Cmjohnson) [19:30:23] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10Peachey88) [19:31:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye [19:31:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1040.eqiad.wmnet with OS bullseye [19:32:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5002.eqsin.wmnet with reason: host reimage [19:33:41] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:34:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P45262 and previous config saved to /var/cache/conftool/dbconfig/20230307-193406-marostegui.json [19:34:41] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:35:35] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host durum4002.ulsfo.wmnet with OS bullseye [19:35:56] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host durum6001.drmrs.wmnet with OS bullseye [19:36:03] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum4002.ulsfo.wmnet with OS bullseye completed: - durum4002 (**PASS**) - Downtimed on Icinga... [19:36:11] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:36:15] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:36:16] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum6001.drmrs.wmnet with OS bullseye executed with errors: - durum6001 (**FAIL**) - Downtime... [19:36:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T329260)', diff saved to https://phabricator.wikimedia.org/P45263 and previous config saved to /var/cache/conftool/dbconfig/20230307-193617-marostegui.json [19:36:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [19:36:25] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [19:36:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [19:36:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45264 and previous config saved to /var/cache/conftool/dbconfig/20230307-193639-marostegui.json [19:37:34] !log brett@cumin2002 conftool action : set/pooled=yess; selector: name=ncredir6001.eqsin.wmnet [19:38:40] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [19:38:46] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895347 (https://phabricator.wikimedia.org/T330204) [19:38:50] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895347 (https://phabricator.wikimedia.org/T330204) (owner: 10TrainBranchBot) [19:39:18] brett: just in case you missed it, note set/pooled=yess vs. yes, I'm not sure what happens in that state! [19:39:29] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895347 (https://phabricator.wikimedia.org/T330204) (owner: 10TrainBranchBot) [19:39:43] rzl: I was just super excited to get it up :) [19:39:46] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir6001.eqsin.wmnet [19:39:50] !log jhuneidi@deploy2002 Started scap: testwikis wikis to 1.40.0-wmf.26 refs T330204 [19:39:53] !log jhuneidi@deploy2002 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki=aawiki --force-version "1.40.0-wmf.26" --no-progress --store-class=LCStoreCDB --threads=30 --lang en --quiet ' returned non-zero exit status 255. (duration: 00m 02s) [19:39:55] rzl: What an interesting eagle eye you have [19:39:56] hell yeah [19:39:56] T330204: 1.40.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T330204 [19:40:10] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir6002.eqsin.wmnet [19:40:20] ahahaha I'm just sitting here STARING, waiting for my moment to pounce [19:40:30] * brett sweats nervously while he types [19:40:35] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@9924c93]: initial deployment to search platform airflow 2 instance [19:40:40] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir6002.drmrs.wmnet with OS bullseye [19:40:42] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@9924c93]: initial deployment to search platform airflow 2 instance (duration: 00m 07s) [19:40:49] (happened to be tabbing through, sorry to backseat-drive, I just wanted to save you some time if it turned out to not be working) [19:40:49] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir6002.drmrs.wmnet with OS bullseye [19:41:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45265 and previous config saved to /var/cache/conftool/dbconfig/20230307-194132-marostegui.json [19:41:39] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [19:43:00] (03PS6) 10Bking: search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) [19:43:22] (03CR) 10CI reject: [V: 04-1] search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [19:43:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P45266 and previous config saved to /var/cache/conftool/dbconfig/20230307-194340-marostegui.json [19:45:33] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:49:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T329203)', diff saved to https://phabricator.wikimedia.org/P45267 and previous config saved to /var/cache/conftool/dbconfig/20230307-194913-marostegui.json [19:49:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [19:49:20] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [19:49:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [19:49:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T329203)', diff saved to https://phabricator.wikimedia.org/P45268 and previous config saved to /var/cache/conftool/dbconfig/20230307-194934-marostegui.json [19:49:39] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:51:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host durum5002.eqsin.wmnet with OS bullseye [19:51:37] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum5002.eqsin.wmnet with OS bullseye completed: - durum5002 (**PASS**) - Downtimed on Icinga... [19:52:55] (03PS7) 10Bking: search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) [19:56:03] (03PS8) 10Bking: search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) [19:56:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P45270 and previous config saved to /var/cache/conftool/dbconfig/20230307-195639-marostegui.json [19:57:14] (03PS1) 10Jforrester: Manually add extensions/Renameuser to wmf.26 [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895350 [19:57:21] (03PS9) 10Bking: search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) [19:57:38] (03PS10) 10Bking: search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) [19:57:50] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir6002.drmrs.wmnet with reason: host reimage [19:57:57] (03PS2) 10Jforrester: Manually add extensions/Renameuser to wmf.26 [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895350 [19:58:17] PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [19:58:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P45272 and previous config saved to /var/cache/conftool/dbconfig/20230307-195846-marostegui.json [20:01:54] (03CR) 10Ebernhardson: [C: 03+1] search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [20:01:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir6002.drmrs.wmnet with reason: host reimage [20:01:56] (03CR) 10Bking: search-airflow: add analytics sql replica creds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [20:02:02] (03CR) 10Bking: [C: 03+2] search-airflow: add analytics sql replica creds [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [20:03:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T329203)', diff saved to https://phabricator.wikimedia.org/P45273 and previous config saved to /var/cache/conftool/dbconfig/20230307-200344-marostegui.json [20:03:51] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [20:04:36] (03PS1) 10Jforrester: Unload RenameUser, now part of core: Part I of II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895351 [20:04:38] (03PS1) 10Jforrester: Unload RenameUser, now part of core: Part II of II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895352 [20:04:41] (03CR) 10Jeena Huneidi: [C: 03+2] Manually add extensions/Renameuser to wmf.26 [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895350 (owner: 10Jforrester) [20:11:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P45274 and previous config saved to /var/cache/conftool/dbconfig/20230307-201145-marostegui.json [20:12:12] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@9924c93]: initial deployment to search platform airflow 2 instance [20:13:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T328817)', diff saved to https://phabricator.wikimedia.org/P45276 and previous config saved to /var/cache/conftool/dbconfig/20230307-201353-marostegui.json [20:13:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [20:14:00] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [20:14:02] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@9924c93]: initial deployment to search platform airflow 2 instance (duration: 01m 49s) [20:14:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [20:14:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T328817)', diff saved to https://phabricator.wikimedia.org/P45277 and previous config saved to /var/cache/conftool/dbconfig/20230307-201414-marostegui.json [20:14:21] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [20:16:38] !log bking@deploy2002 Started deploy [airflow-dags/search@9924c93]: initial deployment to search platform airflow 2 instance-bk [20:17:56] !log bking@deploy2002 Finished deploy [airflow-dags/search@9924c93]: initial deployment to search platform airflow 2 instance-bk (duration: 01m 18s) [20:18:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P45279 and previous config saved to /var/cache/conftool/dbconfig/20230307-201851-marostegui.json [20:19:37] (03CR) 10Jeena Huneidi: [C: 03+2] Manually add extensions/Renameuser to wmf.26 [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895350 (owner: 10Jforrester) [20:19:58] !log brett@cumin2002 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host ncredir6002.drmrs.wmnet with OS bullseye [20:20:09] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir6002.drmrs.wmnet with OS bullseye executed with errors: - ncredir6002 (**FAIL**) - Down... [20:21:54] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir6002.eqsin.wmnet [20:22:43] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:23:04] (03Merged) 10jenkins-bot: Manually add extensions/Renameuser to wmf.26 [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895350 (owner: 10Jforrester) [20:24:03] !log jhuneidi@deploy2002 Started scap: testwikis wikis to 1.40.0-wmf.26 refs T330204 [20:24:09] T330204: 1.40.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T330204 [20:26:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T328817)', diff saved to https://phabricator.wikimedia.org/P45280 and previous config saved to /var/cache/conftool/dbconfig/20230307-202640-marostegui.json [20:26:47] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [20:26:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45281 and previous config saved to /var/cache/conftool/dbconfig/20230307-202652-marostegui.json [20:26:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [20:27:00] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [20:27:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [20:27:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T329260)', diff saved to https://phabricator.wikimedia.org/P45282 and previous config saved to /var/cache/conftool/dbconfig/20230307-202713-marostegui.json [20:27:51] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1040.eqiad.wmnet with OS bullseye [20:27:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1040.eqiad.wmnet with OS bullseye execu... [20:29:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2073.codfw.wmnet with OS bullseye [20:29:58] !log ebernhardson@deploy2002 Started deploy [wikimedia/discovery/analytics@c8dc6d5]: test deploy old airflow instance [20:30:00] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2073.codfw.wmnet with OS bullseye [20:30:04] !log ebernhardson@deploy2002 Finished deploy [wikimedia/discovery/analytics@c8dc6d5]: test deploy old airflow instance (duration: 00m 05s) [20:30:25] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@9924c93]: test deploy new airflow instance [20:30:27] !log ebernhardson@deploy2002 deploy aborted: test deploy new airflow instance (duration: 00m 02s) [20:32:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T329260)', diff saved to https://phabricator.wikimedia.org/P45283 and previous config saved to /var/cache/conftool/dbconfig/20230307-203203-marostegui.json [20:32:10] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [20:33:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P45284 and previous config saved to /var/cache/conftool/dbconfig/20230307-203357-marostegui.json [20:35:14] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir3001.drmrs.wmnet [20:35:34] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir3001.esams.wmnet with OS bullseye [20:35:37] (03PS3) 10Eevans: data-persistence: alert on elevated sessions store error rate (5xx) [alerts] - 10https://gerrit.wikimedia.org/r/893538 (https://phabricator.wikimedia.org/T327960) [20:35:45] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir3001.esams.wmnet with OS bullseye [20:35:51] (03CR) 10Eevans: [C: 03+2] data-persistence: alert on elevated sessions store error rate (5xx) [alerts] - 10https://gerrit.wikimedia.org/r/893538 (https://phabricator.wikimedia.org/T327960) (owner: 10Eevans) [20:35:54] (03CR) 10CI reject: [V: 04-1] data-persistence: alert on elevated sessions store error rate (5xx) [alerts] - 10https://gerrit.wikimedia.org/r/893538 (https://phabricator.wikimedia.org/T327960) (owner: 10Eevans) [20:41:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P45286 and previous config saved to /var/cache/conftool/dbconfig/20230307-204146-marostegui.json [20:41:57] (03PS4) 10Eevans: data-persistence: alert on elevated sessions store error rate (5xx) [alerts] - 10https://gerrit.wikimedia.org/r/893538 (https://phabricator.wikimedia.org/T327960) [20:42:11] (03CR) 10jenkins-bot: data-persistence: alert on elevated sessions store error rate (5xx) [alerts] - 10https://gerrit.wikimedia.org/r/893538 (https://phabricator.wikimedia.org/T327960) (owner: 10Eevans) [20:42:45] (JobUnavailable) firing: Reduced availability for job ncredir in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:46:40] 10SRE, 10LDAP-Access-Requests: Request access to the group ldap/wmf - https://phabricator.wikimedia.org/T331370 (10lwatson) Hi, my apologies @Aklapper - I'll get back to you on this. Let me check with my mentor [20:47:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P45287 and previous config saved to /var/cache/conftool/dbconfig/20230307-204710-marostegui.json [20:49:04] (03PS1) 10Bking: deploy: permit airflow-search-admins group to deploy [puppet] - 10https://gerrit.wikimedia.org/r/895354 (https://phabricator.wikimedia.org/T327970) [20:49:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T329203)', diff saved to https://phabricator.wikimedia.org/P45288 and previous config saved to /var/cache/conftool/dbconfig/20230307-204904-marostegui.json [20:49:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [20:49:11] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [20:49:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [20:49:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T329203)', diff saved to https://phabricator.wikimedia.org/P45289 and previous config saved to /var/cache/conftool/dbconfig/20230307-204925-marostegui.json [20:49:48] (03CR) 10Ryan Kemper: [C: 03+1] deploy: permit airflow-search-admins group to deploy [puppet] - 10https://gerrit.wikimedia.org/r/895354 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [20:49:58] (03CR) 10Bking: [C: 03+2] deploy: permit airflow-search-admins group to deploy [puppet] - 10https://gerrit.wikimedia.org/r/895354 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [20:50:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2073.codfw.wmnet with reason: host reimage [20:50:27] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:50:35] PROBLEM - Disk space on deploy1002 is CRITICAL: DISK CRITICAL - free space: /srv 10603 MB (3% inode=71%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1002&var-datasource=eqiad+prometheus/ops [20:51:04] 10SRE, 10LDAP-Access-Requests: Request access to the group ldap/wmf - https://phabricator.wikimedia.org/T331370 (10lwatson) [20:51:26] 10SRE, 10LDAP-Access-Requests: Request access to the group ldap/wmf - https://phabricator.wikimedia.org/T331370 (10lwatson) 05Stalled→03Open [20:52:45] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3001.esams.wmnet with reason: host reimage [20:53:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2073.codfw.wmnet with reason: host reimage [20:54:16] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) [20:55:19] 10SRE, 10LDAP-Access-Requests: Request access to the group ldap/wmf - https://phabricator.wikimedia.org/T331370 (10AnneT) @NHillard-WMF could you confirm what we should be providing for "purpose"? I think @lwatson will just need appropriate rights in Gerrit, but I don't know the correct terminology here. [20:56:21] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@9924c93]: test deploy new airflow instance [20:56:23] !log ebernhardson@deploy2002 deploy aborted: test deploy new airflow instance (duration: 00m 01s) [20:56:33] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3001.esams.wmnet with reason: host reimage [20:56:43] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@9924c93]: test deploy new airflow instance [20:56:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P45290 and previous config saved to /var/cache/conftool/dbconfig/20230307-205653-marostegui.json [20:58:46] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@9924c93]: test deploy new airflow instance (duration: 02m 03s) [20:59:35] (03PS2) 10Samtar: Enable new Linter UI for namespace, tag and template for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894733 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [20:59:49] I am here :-) [21:00:02] please hold the backports [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230307T2100). [21:00:05] sbailey: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] I can deploy [21:00:13] jeena: ack [21:00:33] we had a couple errors and are running late with the train deploy today [21:00:43] * urbanecm waves [21:00:53] currently syncing to testwikis, then I can deploy group0 [21:00:54] Standing by, take your time [21:00:58] thanks! [21:01:03] will wait to hear from you jeena [21:01:26] busy coding in the mean time queryBuilder is my friend [21:02:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P45291 and previous config saved to /var/cache/conftool/dbconfig/20230307-210216-marostegui.json [21:02:22] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir4001.drmrs.wmnet [21:02:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T329203)', diff saved to https://phabricator.wikimedia.org/P45292 and previous config saved to /var/cache/conftool/dbconfig/20230307-210243-marostegui.json [21:02:50] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [21:02:51] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir4001.ulsfo.wmnet [21:03:13] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir4001.ulsfo.wmnet with OS bullseye [21:03:23] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir4001.ulsfo.wmnet with OS bullseye [21:05:41] 10SRE-tools, 10Infrastructure-Foundations: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10Volans) p:05Triage→03High a:03SLyngshede-WMF Interesting, thanks for the report @BCornwall @SLyngshede-WMF could you have a look please? From a quick... [21:06:04] !log lvs500[45]: disabling puppet and stopping pybal, all eqsin traffic through lvs5006 temporarily... [21:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:00] !log bking@deploy2002 Started deploy [airflow-dags/search@d533716]: initial deployment to search platform airflow 2 instance-bk [21:07:41] !log bking@deploy2002 Finished deploy [airflow-dags/search@d533716]: initial deployment to search platform airflow 2 instance-bk (duration: 00m 41s) [21:07:56] !log jhuneidi@deploy2002 Finished scap: testwikis wikis to 1.40.0-wmf.26 refs T330204 (duration: 43m 53s) [21:08:01] T330204: 1.40.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T330204 [21:09:04] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [21:09:12] PROBLEM - pybal on lvs5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [21:09:22] ^ ignore the pybal-related alerts on lvs500[45] for now, sorry [21:09:34] (03PS1) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [21:09:44] PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [21:10:05] !log jhuneidi@deploy2002 Pruned MediaWiki: 1.40.0-wmf.24 (duration: 02m 08s) [21:10:48] PROBLEM - PyBal connections to etcd on lvs5004 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [21:10:58] !log lvs500[45]: re-enabling/pooling, back to normal flow [21:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:10] RECOVERY - PyBal backends health check on lvs5005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:11:31] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [21:11:48] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:12:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T328817)', diff saved to https://phabricator.wikimedia.org/P45293 and previous config saved to /var/cache/conftool/dbconfig/20230307-211159-marostegui.json [21:12:02] RECOVERY - pybal on lvs5004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [21:12:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [21:12:07] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [21:12:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [21:12:45] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:12:58] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895357 (https://phabricator.wikimedia.org/T330204) [21:13:00] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895357 (https://phabricator.wikimedia.org/T330204) (owner: 10TrainBranchBot) [21:13:41] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895357 (https://phabricator.wikimedia.org/T330204) (owner: 10TrainBranchBot) [21:14:00] (03PS1) 10Volans: homer: increase default timeout to 60s [puppet] - 10https://gerrit.wikimedia.org/r/895358 [21:14:31] (03CR) 10EoghanGaffney: [C: 03+1] "This is great!" [puppet] - 10https://gerrit.wikimedia.org/r/895310 (https://phabricator.wikimedia.org/T331295) (owner: 10Jelto) [21:15:48] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ncredir3001.esams.wmnet with OS bullseye [21:15:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir3001.esams.wmnet with OS bullseye completed: - ncredir3001 (**PASS**) - Downtimed on Ic... [21:16:22] RECOVERY - PyBal connections to etcd on lvs5004 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [21:16:42] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir3001.esams.wmnet [21:17:17] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir3002.esams.wmnet [21:17:20] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:17:20] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir4001.ulsfo.wmnet with reason: host reimage [21:17:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T329260)', diff saved to https://phabricator.wikimedia.org/P45294 and previous config saved to /var/cache/conftool/dbconfig/20230307-211723-marostegui.json [21:17:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [21:17:30] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [21:17:32] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir3002.esams.wmnet with OS bullseye [21:17:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [21:17:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T329260)', diff saved to https://phabricator.wikimedia.org/P45295 and previous config saved to /var/cache/conftool/dbconfig/20230307-211744-marostegui.json [21:17:45] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir3002.esams.wmnet with OS bullseye [21:17:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P45296 and previous config saved to /var/cache/conftool/dbconfig/20230307-211749-marostegui.json [21:18:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T329260)', diff saved to https://phabricator.wikimedia.org/P45297 and previous config saved to /var/cache/conftool/dbconfig/20230307-211857-marostegui.json [21:19:50] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir4001.ulsfo.wmnet with reason: host reimage [21:20:54] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.26 refs T330204 [21:21:00] T330204: 1.40.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T330204 [21:21:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [21:21:32] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:21:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [21:21:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T328817)', diff saved to https://phabricator.wikimedia.org/P45298 and previous config saved to /var/cache/conftool/dbconfig/20230307-212138-marostegui.json [21:21:41] TheresNoTime: sbailey ready for you now [21:21:45] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [21:21:50] ok, ready [21:21:54] jeena: okay :) [21:22:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894733 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:23:18] (03Merged) 10jenkins-bot: Enable new Linter UI for namespace, tag and template for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894733 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:23:56] !log samtar@deploy2002 Started scap: Backport for [[gerrit:894733|Enable new Linter UI for namespace, tag and template for group1 wikis (T299612)]] [21:24:02] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [21:24:36] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [21:25:33] !log samtar@deploy2002 sbailey and samtar: Backport for [[gerrit:894733|Enable new Linter UI for namespace, tag and template for group1 wikis (T299612)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:25:47] sbailey: live on mwdebug (any) — are you able to test? [21:25:55] yes testing now [21:27:13] Yup live on meta and test2, working on both as expected :-) [21:27:22] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host durum6002.drmrs.wmnet with OS bullseye [21:27:29] ack, syncing [21:27:32] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host durum6002.drmrs.wmnet with OS bullseye [21:27:36] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:27:45] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:28:49] great,thanks Sammy and Jeena [21:30:54] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:32:44] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3002.esams.wmnet with reason: host reimage [21:32:45] new LintErrors TheresNoTime ? [21:32:45] (JobUnavailable) resolved: (2) Reduced availability for job ncredir in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:32:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P45299 and previous config saved to /var/cache/conftool/dbconfig/20230307-213256-marostegui.json [21:32:58] herzog: ack, looking [21:33:08] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:894733|Enable new Linter UI for namespace, tag and template for group1 wikis (T299612)]] (duration: 09m 11s) [21:33:14] herzog: where are you seeing that? [21:33:15] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [21:33:18] sbailey: ^ [21:33:20] New report UI for lint errors, no change in recording or parsoid generation [21:33:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T328817)', diff saved to https://phabricator.wikimedia.org/P45300 and previous config saved to /var/cache/conftool/dbconfig/20230307-213334-marostegui.json [21:33:37] TheresNoTime: what s-bailey said :) [21:33:40] Adds search on tag and template info [21:33:41] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [21:33:50] herzog: oh I thought you meant you were *seeing* errors, sorry [21:33:55] * TheresNoTime did a panic [21:34:00] search interface there [21:34:04] looks great [21:34:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P45301 and previous config saved to /var/cache/conftool/dbconfig/20230307-213403-marostegui.json [21:34:22] 894733 is now live fwiw :) [21:35:09] Sweet, seeing it on production meta and test2 great !!!! [21:35:12] thanks [21:35:27] np :) [21:35:53] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3002.esams.wmnet with reason: host reimage [21:37:27] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ncredir4001.ulsfo.wmnet with OS bullseye [21:37:38] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir4001.ulsfo.wmnet with OS bullseye completed: - ncredir4001 (**PASS**) - Downtimed on Ic... [21:37:47] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir4001.ulsfo.wmnet [21:38:28] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:38:35] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:38:38] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir4002.ulsfo.wmnet [21:39:02] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir4002.ulsfo.wmnet with OS bullseye [21:39:13] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir4002.ulsfo.wmnet with OS bullseye [21:40:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:40:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2073.codfw.wmnet with OS bullseye [21:41:00] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2073.codfw.wmnet with OS bullseye completed: - ms-be... [21:41:22] !log bking@cumin2002 ban elastic row D hosts to prepare for T322082 [21:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:28] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [21:41:37] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:42:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [21:43:00] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) [21:43:08] !log close UTC late backport window [21:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:45] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [21:46:14] (JobUnavailable) firing: Reduced availability for job ncredir in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:46:40] (03PS1) 10JHathaway: kernel-purge: include previous kernel version [puppet] - 10https://gerrit.wikimedia.org/r/895361 (https://phabricator.wikimedia.org/T277011) [21:48:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T329203)', diff saved to https://phabricator.wikimedia.org/P45302 and previous config saved to /var/cache/conftool/dbconfig/20230307-214802-marostegui.json [21:48:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [21:48:11] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [21:48:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [21:48:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T329203)', diff saved to https://phabricator.wikimedia.org/P45303 and previous config saved to /var/cache/conftool/dbconfig/20230307-214824-marostegui.json [21:48:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P45304 and previous config saved to /var/cache/conftool/dbconfig/20230307-214841-marostegui.json [21:49:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P45305 and previous config saved to /var/cache/conftool/dbconfig/20230307-214910-marostegui.json [21:52:09] (03CR) 10JHathaway: [C: 03+2] kernel-purge: include previous kernel version [puppet] - 10https://gerrit.wikimedia.org/r/895361 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [21:52:10] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir4002.ulsfo.wmnet with reason: host reimage [21:54:35] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ncredir3002.esams.wmnet with OS bullseye [21:54:45] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir3002.esams.wmnet with OS bullseye completed: - ncredir3002 (**PASS**) - Downtimed on Ic... [21:54:59] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir3002.esams.wmnet [21:55:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir4002.ulsfo.wmnet with reason: host reimage [21:55:27] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:56:06] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir2001.codfw.wmnet [21:56:23] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir2001.codfw.wmnet with OS bullseye [21:56:33] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir2001.codfw.wmnet with OS bullseye [21:56:55] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 7 hosts with reason: re-rack [21:57:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 7 hosts with reason: re-rack [21:57:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2faab0f0-8bed-4101-9f19-d26f3c99b3d7) set by bking@cumin2002 for 1 day, 0:00:00 on 7 host(s) and their... [21:58:40] !log bking@cumin2002 depool elastic row D hosts to prepare for T322082 [21:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:46] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [21:59:15] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host durum6002.drmrs.wmnet with OS bullseye [21:59:25] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host durum6002.drmrs.wmnet with OS bullseye executed with errors: - durum6002 (**FAIL**) - Downtime... [21:59:27] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:01:40] 10SRE-tools, 10Infrastructure-Foundations: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10ssingh) First of all, thanks so much for the Ganeti cookbook -- it's a lifesaver. I can't imagine reimaging these hosts without the cookbook and all the man... [22:02:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T329203)', diff saved to https://phabricator.wikimedia.org/P45306 and previous config saved to /var/cache/conftool/dbconfig/20230307-220222-marostegui.json [22:02:29] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [22:03:03] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@9fba86b]: (no justification provided) [22:03:21] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@9fba86b]: (no justification provided) (duration: 00m 18s) [22:03:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P45307 and previous config saved to /var/cache/conftool/dbconfig/20230307-220348-marostegui.json [22:04:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T329260)', diff saved to https://phabricator.wikimedia.org/P45308 and previous config saved to /var/cache/conftool/dbconfig/20230307-220416-marostegui.json [22:04:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [22:04:23] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [22:04:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [22:04:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T329260)', diff saved to https://phabricator.wikimedia.org/P45309 and previous config saved to /var/cache/conftool/dbconfig/20230307-220438-marostegui.json [22:05:37] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:05:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T329260)', diff saved to https://phabricator.wikimedia.org/P45310 and previous config saved to /var/cache/conftool/dbconfig/20230307-220550-marostegui.json [22:06:01] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [22:06:14] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:06:41] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir2001.codfw.wmnet with reason: host reimage [22:09:10] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir2001.codfw.wmnet with reason: host reimage [22:09:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) 05Open→03Resolved @MatthewVernon all yours thank you for getting the partman recipe [22:10:22] (03PS1) 10JHathaway: jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/895363 [22:11:14] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:12:15] (03CR) 10JHathaway: [C: 03+2] jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/895363 (owner: 10JHathaway) [22:13:11] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ncredir4002.ulsfo.wmnet with OS bullseye [22:13:21] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir4002.ulsfo.wmnet with OS bullseye completed: - ncredir4002 (**PASS**) - Downtimed on Ic... [22:13:34] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir4002.ulsfo.wmnet [22:14:28] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir1001.eqiad.wmnet [22:14:39] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir1001.eqiad.wmnet with OS bullseye [22:15:09] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [22:15:16] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir1001.eqiad.wmnet with OS bullseye [22:17:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P45311 and previous config saved to /var/cache/conftool/dbconfig/20230307-221729-marostegui.json [22:18:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T328817)', diff saved to https://phabricator.wikimedia.org/P45312 and previous config saved to /var/cache/conftool/dbconfig/20230307-221854-marostegui.json [22:18:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [22:19:01] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [22:19:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [22:19:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [22:19:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [22:19:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T328817)', diff saved to https://phabricator.wikimedia.org/P45313 and previous config saved to /var/cache/conftool/dbconfig/20230307-221931-marostegui.json [22:20:28] (03PS1) 10Volans: alertmanager: match also FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/895364 [22:20:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P45314 and previous config saved to /var/cache/conftool/dbconfig/20230307-222056-marostegui.json [22:21:14] (JobUnavailable) resolved: (2) Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:21:57] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10Papaul) 05Open→03Resolved a:03Papaul @Jhancock.wm thank you we can resolve this task [22:22:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) [22:23:20] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ncredir2001.codfw.wmnet with OS bullseye [22:23:30] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir2001.codfw.wmnet with OS bullseye completed: - ncredir2001 (**PASS**) - Downtimed on Ic... [22:25:33] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir2001.codfw.wmnet [22:26:11] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [22:26:24] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir2002.codfw.wmnet [22:26:35] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir2002.codfw.wmnet with OS bullseye [22:26:46] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir2002.codfw.wmnet with OS bullseye [22:26:56] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir1001.eqiad.wmnet with reason: host reimage [22:31:29] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir1001.eqiad.wmnet with reason: host reimage [22:32:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P45315 and previous config saved to /var/cache/conftool/dbconfig/20230307-223235-marostegui.json [22:33:45] (JobUnavailable) firing: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:36:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P45316 and previous config saved to /var/cache/conftool/dbconfig/20230307-223603-marostegui.json [22:36:51] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir2002.codfw.wmnet with reason: host reimage [22:39:59] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir2002.codfw.wmnet with reason: host reimage [22:44:22] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ncredir1001.eqiad.wmnet with OS bullseye [22:44:33] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir1001.eqiad.wmnet with OS bullseye completed: - ncredir1001 (**PASS**) - Downtimed on Ic... [22:47:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T329203)', diff saved to https://phabricator.wikimedia.org/P45317 and previous config saved to /var/cache/conftool/dbconfig/20230307-224742-marostegui.json [22:47:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [22:47:49] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [22:47:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [22:48:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T329203)', diff saved to https://phabricator.wikimedia.org/P45318 and previous config saved to /var/cache/conftool/dbconfig/20230307-224803-marostegui.json [22:51:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T329260)', diff saved to https://phabricator.wikimedia.org/P45319 and previous config saved to /var/cache/conftool/dbconfig/20230307-225110-marostegui.json [22:51:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:51:17] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [22:51:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:52:58] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) http [22:52:58] tech.wikimedia.org/wiki/RESTBase [22:53:45] (JobUnavailable) resolved: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:54:22] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [22:54:34] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ncredir2002.codfw.wmnet with OS bullseye [22:54:45] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir2002.codfw.wmnet with OS bullseye completed: - ncredir2002 (**PASS**) - Downtimed on Ic... [22:55:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [22:55:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [22:57:13] (03CR) 10JHathaway: kernel-purge: enable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894729 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [22:59:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [22:59:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [22:59:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T329260)', diff saved to https://phabricator.wikimedia.org/P45321 and previous config saved to /var/cache/conftool/dbconfig/20230307-225951-marostegui.json [22:59:58] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [23:01:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T329203)', diff saved to https://phabricator.wikimedia.org/P45322 and previous config saved to /var/cache/conftool/dbconfig/20230307-230156-marostegui.json [23:02:03] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [23:03:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T329260)', diff saved to https://phabricator.wikimedia.org/P45323 and previous config saved to /var/cache/conftool/dbconfig/20230307-230317-marostegui.json [23:04:26] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:15:12] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:17:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P45324 and previous config saved to /var/cache/conftool/dbconfig/20230307-231702-marostegui.json [23:18:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P45325 and previous config saved to /var/cache/conftool/dbconfig/20230307-231824-marostegui.json [23:19:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T328817)', diff saved to https://phabricator.wikimedia.org/P45326 and previous config saved to /var/cache/conftool/dbconfig/20230307-231957-marostegui.json [23:20:04] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [23:25:22] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:27:54] 10SRE, 10ops-codfw, 10Data-Persistence (work done), 10decommission-hardware: decommission db2094.codfw.wmnet - https://phabricator.wikimedia.org/T330828 (10Jhancock.wm) [23:29:14] 10SRE, 10ops-codfw, 10Data-Persistence (work done), 10decommission-hardware: decommission db2094.codfw.wmnet - https://phabricator.wikimedia.org/T330828 (10Jhancock.wm) 05Open→03Resolved Thanks for the direction @Papaul. This is completed. [23:29:17] 10SRE, 10ops-codfw, 10Data-Persistence (work done), 10decommission-hardware: decommission db2095.codfw.wmnet - https://phabricator.wikimedia.org/T330975 (10Jhancock.wm) [23:29:55] 10SRE, 10ops-codfw, 10Data-Persistence (work done), 10decommission-hardware: decommission db2095.codfw.wmnet - https://phabricator.wikimedia.org/T330975 (10Jhancock.wm) 05Open→03Resolved This is completed. [23:30:44] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:57] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir1001.eqiad.wmnet [23:31:33] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [23:31:59] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir2002.codfw.wmnet [23:32:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P45327 and previous config saved to /var/cache/conftool/dbconfig/20230307-233209-marostegui.json [23:32:16] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [23:32:31] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=ncredir1002.eqiad.wmnet [23:32:43] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir1002.eqiad.wmnet with OS bullseye [23:32:56] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir1002.eqiad.wmnet with OS bullseye [23:33:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P45328 and previous config saved to /var/cache/conftool/dbconfig/20230307-233330-marostegui.json [23:35:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P45329 and previous config saved to /var/cache/conftool/dbconfig/20230307-233503-marostegui.json [23:39:45] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:39:52] !log ryankemper@deploy2002 Started deploy [airflow-dags/search@3419b7d]: initial deployment to new search platform airflow 2 instance - ryankemper [23:40:08] !log ryankemper@deploy2002 Finished deploy [airflow-dags/search@3419b7d]: initial deployment to new search platform airflow 2 instance - ryankemper (duration: 00m 15s) [23:47:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T329203)', diff saved to https://phabricator.wikimedia.org/P45331 and previous config saved to /var/cache/conftool/dbconfig/20230307-234715-marostegui.json [23:47:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [23:47:23] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [23:47:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [23:47:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:47:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:47:38] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:47:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T329203)', diff saved to https://phabricator.wikimedia.org/P45332 and previous config saved to /var/cache/conftool/dbconfig/20230307-234741-marostegui.json [23:48:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T329260)', diff saved to https://phabricator.wikimedia.org/P45333 and previous config saved to /var/cache/conftool/dbconfig/20230307-234837-marostegui.json [23:48:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [23:48:44] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [23:48:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [23:48:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T329260)', diff saved to https://phabricator.wikimedia.org/P45334 and previous config saved to /var/cache/conftool/dbconfig/20230307-234858-marostegui.json [23:50:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P45335 and previous config saved to /var/cache/conftool/dbconfig/20230307-235010-marostegui.json [23:54:45] (JobUnavailable) resolved: Reduced availability for job ncredir in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:55:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T329260)', diff saved to https://phabricator.wikimedia.org/P45336 and previous config saved to /var/cache/conftool/dbconfig/20230307-235529-marostegui.json [23:55:37] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260