[00:08:30] (03PS3) 10Tim Starling: Factor out x2 per-host hieradata into an objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) [00:09:38] (03CR) 10CI reject: [V: 04-1] SqlBagOStuff: Fix modtoken comparison [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824445 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling) [00:13:54] (03Merged) 10jenkins-bot: SqlBagOStuff: Fix modtoken comparison [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824445 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling) [00:18:48] (03CR) 10Tim Starling: Factor out x2 per-host hieradata into an objectstash role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) (owner: 10Tim Starling) [00:25:23] !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.25/includes/objectcache/SqlBagOStuff.php: fix modtoken comparison T315271 (duration: 03m 45s) [00:25:27] T315271: db1151, db2144 X2 masters error: Could not execute Delete_rows_v1 event on table mainstash.objectstash - https://phabricator.wikimedia.org/T315271 [00:26:16] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:28:26] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:30:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:34:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:34:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:38:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:00:40] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Designate codfw1dev to Xena [puppet] - 10https://gerrit.wikimedia.org/r/824885 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [01:05:42] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:05:54] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:13:34] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:45] (JobUnavailable) firing: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:15:54] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:45] (JobUnavailable) resolved: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:36:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (6) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:26] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:12] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:16] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:22] RECOVERY - snapshot of s5 in eqiad on backupmon1001 is OK: Last snapshot for s5 at eqiad (db1150) taken on 2022-08-22 01:14:01 (512 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:32] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:12] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:18] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:23:50] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:54] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:54] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:39:42] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:45:00] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:47:20] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:51:18] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:08:04] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:12:14] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:15:10] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:22] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 6 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:26:56] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:40] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:33:54] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:36:58] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:28] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:50] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:01:34] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:02:20] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:07:04] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:26:39] 10SRE, 10Infrastructure-Foundations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Ladsgroup) Migrating it to mailman3 would help if the volume is not too large. cc. @Ottomata [05:27:15] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10Ladsgroup) 05In progress→03Resolved a:03cmooney Given that there is no answer, I close this. Please reopen if you can't access. [05:29:31] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10Ladsgroup) p:05Triage→03Medium (don't mind me, SRE clinic duty) [05:32:06] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:35:08] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:37:36] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10Ladsgroup) 05In progress→03Resolved a:03cmooney Since there hasn't been any response. I close it, reopen if you have trouble accessing. [05:38:51] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10Ladsgroup) 05In progress→03Resolved a:05odimitrijevic→03cmooney Since there hasn't been any response. I close it, reopen if you have trouble acc... [05:52:06] (03PS1) 10Marostegui: instances.yaml: Add db2178 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825072 (https://phabricator.wikimedia.org/T311494) [05:53:01] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2178 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825072 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [05:54:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2178 to dbctl T311494', diff saved to https://phabricator.wikimedia.org/P32651 and previous config saved to /var/cache/conftool/dbconfig/20220822-055446-marostegui.json [05:54:51] T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494 [05:55:33] (03PS1) 10Marostegui: db2178: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825073 (https://phabricator.wikimedia.org/T311494) [06:04:24] (03CR) 10Marostegui: [C: 03+2] db2178: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825073 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [06:09:10] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:10:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:10:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:11:18] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:11:20] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 80, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:11:32] (03PS1) 10Marostegui: db2179: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825075 (https://phabricator.wikimedia.org/T311494) [06:12:28] (03CR) 10Marostegui: [C: 03+2] db2179: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825075 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [06:13:50] (03PS1) 10Marostegui: instances.yaml: Add db2179 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825076 (https://phabricator.wikimedia.org/T311494) [06:14:52] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2179 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825076 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [06:15:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [06:15:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [06:15:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:15:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:15:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2179 to dbctl T311494', diff saved to https://phabricator.wikimedia.org/P32652 and previous config saved to /var/cache/conftool/dbconfig/20220822-061553-marostegui.json [06:15:57] T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494 [06:16:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T312972)', diff saved to https://phabricator.wikimedia.org/P32653 and previous config saved to /var/cache/conftool/dbconfig/20220822-061600-marostegui.json [06:16:04] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [06:22:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T312972)', diff saved to https://phabricator.wikimedia.org/P32654 and previous config saved to /var/cache/conftool/dbconfig/20220822-062246-marostegui.json [06:22:51] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [06:25:12] (03CR) 10Muehlenhoff: "Looks good, one typo inline" [puppet] - 10https://gerrit.wikimedia.org/r/824164 (owner: 10Jbond) [06:27:40] 10SRE, 10Infrastructure-Foundations, 10observability: icinga raid montioring broken for H750 controllers - https://phabricator.wikimedia.org/T315608 (10MoritzMuehlenhoff) It's not broken, it's just not yet implemented :-) https://gerrit.wikimedia.org/r/c/operations/puppet/+/812250 is the main patch, but it f... [06:28:04] (03CR) 10Marostegui: [C: 03+2] Factor out x2 per-host hieradata into an objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) (owner: 10Tim Starling) [06:31:21] (03PS1) 10Marostegui: db2180: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825230 (https://phabricator.wikimedia.org/T311494) [06:32:09] (03CR) 10Marostegui: [C: 03+2] db2180: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825230 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [06:33:20] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:33:43] (03PS1) 10Marostegui: db2180: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825231 (https://phabricator.wikimedia.org/T311494) [06:34:26] (03CR) 10Marostegui: [C: 03+2] db2180: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825231 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [06:35:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2180 to dbctl T311494', diff saved to https://phabricator.wikimedia.org/P32655 and previous config saved to /var/cache/conftool/dbconfig/20220822-063533-marostegui.json [06:35:38] T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494 [06:37:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P32656 and previous config saved to /var/cache/conftool/dbconfig/20220822-063752-marostegui.json [06:38:22] !log Install 10.4.26 on db1119, db1142, db1096 T315411 [06:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:26] T315411: Compile and package MariaDB 10.6.9 and 10.4.26 - https://phabricator.wikimedia.org/T315411 [06:38:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 db1142 db1096', diff saved to https://phabricator.wikimedia.org/P32657 and previous config saved to /var/cache/conftool/dbconfig/20220822-063857-root.json [06:39:43] (03PS7) 10Ori: Incremental roll-out of query-sorting (0%) [puppet] - 10https://gerrit.wikimedia.org/r/822434 (https://phabricator.wikimedia.org/T314868) [06:39:45] (03PS1) 10Ori: Incremental roll-out of query-sorting (1%) [puppet] - 10https://gerrit.wikimedia.org/r/825232 (https://phabricator.wikimedia.org/T314868) [06:43:58] Amir1, urbanecm: I have a patch for this upcoming window, but it's the seventh and the max is six. I'm also going to be a few minutes late, need to get to a different location. If you're up for it, ping me when you're done with the other patches in the window, but if not that's OK also. [06:44:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 1%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32658 and previous config saved to /var/cache/conftool/dbconfig/20220822-064418-root.json [06:44:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P32659 and previous config saved to /var/cache/conftool/dbconfig/20220822-064424-root.json [06:44:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P32660 and previous config saved to /var/cache/conftool/dbconfig/20220822-064448-root.json [06:44:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P32661 and previous config saved to /var/cache/conftool/dbconfig/20220822-064457-root.json [06:45:54] ori: I suggest letting the window finish and then let's do it either together or you self-serve [06:46:06] (basically what you said :D) [06:46:26] ack, sg. [06:48:55] (03PS1) 10Marostegui: db-production: Set es4 as RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825234 (https://phabricator.wikimedia.org/T315540) [06:52:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P32662 and previous config saved to /var/cache/conftool/dbconfig/20220822-065258-marostegui.json [06:54:30] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:59:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 5%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32663 and previous config saved to /var/cache/conftool/dbconfig/20220822-065923-root.json [06:59:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P32664 and previous config saved to /var/cache/conftool/dbconfig/20220822-065929-root.json [06:59:42] ori: I think you're looking the calendar for a week ago, the upcoming window is empty [06:59:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P32665 and previous config saved to /var/cache/conftool/dbconfig/20220822-065953-root.json [07:00:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P32666 and previous config saved to /var/cache/conftool/dbconfig/20220822-070001-root.json [07:00:05] Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T0700). Please do the needful. [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:20] haha [07:00:25] ori: have fun [07:01:42] (03CR) 10Ladsgroup: [C: 03+1] db-production: Set es4 as RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825234 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui) [07:05:17] (03PS1) 10Marostegui: db2181: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825235 (https://phabricator.wikimedia.org/T311494) [07:07:00] (03CR) 10Marostegui: [C: 03+2] db2181: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825235 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [07:08:01] (03CR) 10Muehlenhoff: P:systemd::timesyncd: allow overriding the protectsystem systemd param (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [07:08:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T312972)', diff saved to https://phabricator.wikimedia.org/P32667 and previous config saved to /var/cache/conftool/dbconfig/20220822-070804-marostegui.json [07:08:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [07:08:10] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [07:08:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [07:08:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 7 hosts with reason: Maintenance [07:08:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 7 hosts with reason: Maintenance [07:09:00] (03PS1) 10Marostegui: instances.yaml: Add db2181 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825236 (https://phabricator.wikimedia.org/T311494) [07:10:24] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2181 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825236 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [07:11:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2181 to dbctl T311494', diff saved to https://phabricator.wikimedia.org/P32668 and previous config saved to /var/cache/conftool/dbconfig/20220822-071153-marostegui.json [07:11:58] T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494 [07:14:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32669 and previous config saved to /var/cache/conftool/dbconfig/20220822-071427-root.json [07:14:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P32670 and previous config saved to /var/cache/conftool/dbconfig/20220822-071433-root.json [07:14:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P32671 and previous config saved to /var/cache/conftool/dbconfig/20220822-071458-root.json [07:15:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P32672 and previous config saved to /var/cache/conftool/dbconfig/20220822-071506-root.json [07:20:25] (03PS1) 10Muehlenhoff: Remove access for dpifke [puppet] - 10https://gerrit.wikimedia.org/r/825238 [07:23:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [07:23:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [07:23:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T312972)', diff saved to https://phabricator.wikimedia.org/P32673 and previous config saved to /var/cache/conftool/dbconfig/20220822-072339-marostegui.json [07:23:43] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [07:24:30] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for dpifke [puppet] - 10https://gerrit.wikimedia.org/r/825238 (owner: 10Muehlenhoff) [07:26:09] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:29:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32674 and previous config saved to /var/cache/conftool/dbconfig/20220822-072932-root.json [07:29:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P32675 and previous config saved to /var/cache/conftool/dbconfig/20220822-072938-root.json [07:30:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P32676 and previous config saved to /var/cache/conftool/dbconfig/20220822-073002-root.json [07:30:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P32677 and previous config saved to /var/cache/conftool/dbconfig/20220822-073010-root.json [07:32:16] ori: Amir1: the limit is merely an arbitrary suggestion. I guess at some point we found out 6 patches would fit in a one hour window [07:33:02] similar to the no deploy fridays ;D [07:38:38] 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Marostegui) {P32679} [07:39:29] 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Marostegui) {F35483565} [07:44:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32681 and previous config saved to /var/cache/conftool/dbconfig/20220822-074437-root.json [07:44:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P32682 and previous config saved to /var/cache/conftool/dbconfig/20220822-074443-root.json [07:45:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P32683 and previous config saved to /var/cache/conftool/dbconfig/20220822-074507-root.json [07:45:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P32684 and previous config saved to /var/cache/conftool/dbconfig/20220822-074515-root.json [07:45:55] ori: personally, I don’t have an issue with going over six patches, as long as there is time for everything. [07:47:12] Thanks [07:49:21] (03PS1) 10Marostegui: db2182: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825243 (https://phabricator.wikimedia.org/T311494) [07:50:20] (03CR) 10Marostegui: [C: 03+2] db2182: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825243 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [07:51:38] (03PS1) 10Marostegui: instances.yaml: Add db2182 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825244 (https://phabricator.wikimedia.org/T311494) [07:52:27] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2182 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825244 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [07:54:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2182 to dbctl T311494', diff saved to https://phabricator.wikimedia.org/P32685 and previous config saved to /var/cache/conftool/dbconfig/20220822-075359-marostegui.json [07:54:04] T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494 [07:54:37] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Marostegui) [07:59:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32686 and previous config saved to /var/cache/conftool/dbconfig/20220822-075941-root.json [07:59:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P32687 and previous config saved to /var/cache/conftool/dbconfig/20220822-075949-root.json [08:00:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P32688 and previous config saved to /var/cache/conftool/dbconfig/20220822-080012-root.json [08:00:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P32689 and previous config saved to /var/cache/conftool/dbconfig/20220822-080020-root.json [08:04:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T312972)', diff saved to https://phabricator.wikimedia.org/P32690 and previous config saved to /var/cache/conftool/dbconfig/20220822-080424-marostegui.json [08:04:29] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [08:05:37] (03CR) 10Wangombe: [C: 03+1] TranslatableBundleLogFormatter: Cast reason to string before passing it [extensions/Translate] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824442 (https://phabricator.wikimedia.org/T315657) (owner: 10Jforrester) [08:05:45] (03CR) 10Marostegui: [C: 03+2] db-production: Set es4 as RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825234 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui) [08:06:24] (03CR) 10Hashar: [C: 04-1] "Needs a rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [08:06:35] (03CR) 10DCausse: [C: 03+1] "Thanks for the cleanups!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820398 (owner: 10Lucas Werkmeister (WMDE)) [08:07:00] (03Merged) 10jenkins-bot: db-production: Set es4 as RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825234 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui) [08:07:14] (03PS4) 10Ladsgroup: Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [08:08:12] PROBLEM - SSH on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:08:22] PROBLEM - SSH on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:10:43] (03PS5) 10Giuseppe Lavagetto: Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) [08:10:45] (03PS4) 10Giuseppe Lavagetto: Move 0.1% of user traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823675 (https://phabricator.wikimedia.org/T271736) [08:10:47] (03PS4) 10Giuseppe Lavagetto: Move 1% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823676 (https://phabricator.wikimedia.org/T271736) [08:10:49] (03PS4) 10Giuseppe Lavagetto: Move 5% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823677 (https://phabricator.wikimedia.org/T271736) [08:10:51] (03PS4) 10Giuseppe Lavagetto: Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736) [08:10:53] (03PS4) 10Giuseppe Lavagetto: Move 1 of 6 users to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823679 (https://phabricator.wikimedia.org/T271736) [08:10:55] (03PS4) 10Giuseppe Lavagetto: Move 50% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823680 (https://phabricator.wikimedia.org/T271736) [08:10:57] (03PS4) 10Giuseppe Lavagetto: Move 100% of cookie-accepting clients to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823681 (https://phabricator.wikimedia.org/T271736) [08:11:17] !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Disable writes on es4 T315540 (duration: 03m 35s) [08:11:22] T315540: switchover es4 master es1020 -> es1021 - https://phabricator.wikimedia.org/T315540 [08:12:05] (03CR) 10Giuseppe Lavagetto: Add variables regulating the php 7.4 transition (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [08:12:57] (03PS1) 10Marostegui: mariadb: Promote es1021 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/825245 (https://phabricator.wikimedia.org/T315540) [08:13:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:13:54] (03PS1) 10Marostegui: wmnet: Update es-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/825247 (https://phabricator.wikimedia.org/T315540) [08:14:32] (03PS1) 10Marostegui: Revert "db-production: Set es4 as RO" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825266 [08:14:36] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote es1021 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/825245 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui) [08:14:39] (03CR) 10Marostegui: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825266 (owner: 10Marostegui) [08:14:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P32691 and previous config saved to /var/cache/conftool/dbconfig/20220822-081453-root.json [08:14:57] (03CR) 10Filippo Giunchedi: "Thank you for investigating this!" [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [08:15:03] (03CR) 10Hashar: [C: 03+1] Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [08:15:05] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update es-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/825247 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui) [08:15:24] (03CR) 10Ladsgroup: [C: 03+1] Revert "db-production: Set es4 as RO" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825266 (owner: 10Marostegui) [08:15:32] _joe_: should I deploy your patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/823674/ ? [08:15:36] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes202[34] implementation tracking - https://phabricator.wikimedia.org/T313871 (10JMeybohm) [08:15:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [08:15:52] <_joe_> hashar: I can deploy it myself, just need a +1 [08:15:56] done [08:16:10] <_joe_> <3 [08:16:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Prod-Kubernetes, 10serviceops: kubernetes202[34] implementation tracking - https://phabricator.wikimedia.org/T313871 (10JMeybohm) [08:16:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [08:17:55] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Switchover es4 T315540 [08:17:59] T315540: switchover es4 master es1020 -> es1021 - https://phabricator.wikimedia.org/T315540 [08:18:01] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Switchover es4 T315540 [08:18:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:18:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:18:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1021 with weight 10 T315540', diff saved to https://phabricator.wikimedia.org/P32692 and previous config saved to /var/cache/conftool/dbconfig/20220822-081817-root.json [08:19:03] (03Merged) 10jenkins-bot: Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [08:19:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P32693 and previous config saved to /var/cache/conftool/dbconfig/20220822-081930-marostegui.json [08:19:50] PROBLEM - Check systemd state on wdqs1016 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:20:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es1021 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/825245 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui) [08:21:19] !log Starting es4 eqiad failover from es1020 to es1021 - T315540 [08:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1021 to es4 primary T315540', diff saved to https://phabricator.wikimedia.org/P32694 and previous config saved to /var/cache/conftool/dbconfig/20220822-082208-root.json [08:23:14] (03CR) 10Marostegui: [C: 03+2] wmnet: Update es-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/825247 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui) [08:23:28] PROBLEM - Check systemd state on wdqs1015 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:58] PROBLEM - SSH on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:25:12] (03CR) 10Marostegui: Revert "db-production: Set es4 as RO" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825266 (owner: 10Marostegui) [08:25:14] (03CR) 10Marostegui: [C: 03+2] Revert "db-production: Set es4 as RO" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825266 (owner: 10Marostegui) [08:25:36] (03CR) 10Filippo Giunchedi: WIP dispatch: add database role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [08:25:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:26:09] (03Merged) 10jenkins-bot: Revert "db-production: Set es4 as RO" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825266 (owner: 10Marostegui) [08:26:30] sigh... the 3 new wdqs nodes wdqs1014, wdqs1015 and wdqs1016 seem to falling apart, can't connect to them, is there anyone available to have a quick look to them? [08:29:00] !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Introducing variables for php 7.4 migration (duration: 03m 39s) [08:29:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P32695 and previous config saved to /var/cache/conftool/dbconfig/20220822-082958-root.json [08:30:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:30:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:31:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:31:34] dcausse: I can't even log in via the serial console, on wdqs1014 I can only see a wdqs-categories spewing 100s lines of error messages per second. Shall I just powercycle? [08:31:55] (I get a serial console, but no tty login is possible) [08:32:18] moritzm: thanks for looking! yes please :) [08:32:23] !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Enable writes on es4 T315540 (duration: 03m 17s) [08:32:27] T315540: switchover es4 master es1020 -> es1021 - https://phabricator.wikimedia.org/T315540 [08:32:48] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:33:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1020 for reboot T310485', diff saved to https://phabricator.wikimedia.org/P32696 and previous config saved to /var/cache/conftool/dbconfig/20220822-083341-root.json [08:33:52] !log powercycling wdqs1014 (unresponsive via botched wdqs-categories process [08:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P32697 and previous config saved to /var/cache/conftool/dbconfig/20220822-083436-marostegui.json [08:34:58] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:35:10] PROBLEM - Host wdqs1014 is DOWN: PING CRITICAL - Packet loss = 100% [08:36:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:36:36] RECOVERY - Host wdqs1014 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [08:37:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:37:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:37:22] RECOVERY - Check systemd state on wdqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:30] RECOVERY - SSH on wdqs1014 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:40:01] dcausse: same issue on wdqs1015, also going to powercycle it [08:40:54] although, actually I can log in just fine (although the console keeps getting spammed the same manner as 1014) [08:41:39] 10SRE, 10SRE-Access-Requests, 10serviceops: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Ladsgroup) there is no checklist here, is there anything left before closing this ticket? [08:42:36] dcausse: same thing for 1016 [08:42:48] 10SRE, 10SRE-Access-Requests, 10serviceops: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert No, we're done I think. Closing. [08:43:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 1%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32698 and previous config saved to /var/cache/conftool/dbconfig/20220822-084335-root.json [08:43:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1020 ', diff saved to https://phabricator.wikimedia.org/P32699 and previous config saved to /var/cache/conftool/dbconfig/20220822-084359-root.json [08:46:58] RECOVERY - Check systemd state on wdqs1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:06] moritzm: thanks! [08:48:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 1%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32700 and previous config saved to /var/cache/conftool/dbconfig/20220822-084800-root.json [08:49:24] RECOVERY - SSH on wdqs1016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:49:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T312972)', diff saved to https://phabricator.wikimedia.org/P32701 and previous config saved to /var/cache/conftool/dbconfig/20220822-084942-marostegui.json [08:49:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [08:49:47] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [08:50:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [08:50:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T312972)', diff saved to https://phabricator.wikimedia.org/P32702 and previous config saved to /var/cache/conftool/dbconfig/20220822-085014-marostegui.json [08:50:32] ACKNOWLEDGEMENT - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Acknowledged: T315850 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:54:16] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:55:46] (03CR) 10JMeybohm: [C: 03+1] "Feel free to ignore the naming nit. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [08:56:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T312972)', diff saved to https://phabricator.wikimedia.org/P32703 and previous config saved to /var/cache/conftool/dbconfig/20220822-085654-marostegui.json [08:57:01] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [08:57:15] (03PS1) 10Muehlenhoff: Remove now redundant group [puppet] - 10https://gerrit.wikimedia.org/r/825250 [08:59:11] (03PS2) 10Btullis: Add a new signing profile for the dse_k8s cfssl-issuer [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196) [09:02:59] (03PS4) 10Btullis: Add new admin_ng values for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) [09:03:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 2%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32704 and previous config saved to /var/cache/conftool/dbconfig/20220822-090305-root.json [09:10:55] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) Any ETA for getting db1187 and db1185 online? [09:11:19] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) [09:11:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/824776 (owner: 10Eevans) [09:12:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P32705 and previous config saved to /var/cache/conftool/dbconfig/20220822-091200-marostegui.json [09:12:09] (03PS2) 10Tim Starling: Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/824039 (https://phabricator.wikimedia.org/T315271) [09:13:44] (03PS4) 10Filippo Giunchedi: WIP dispatch: add database role [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) [09:13:46] (03PS4) 10Filippo Giunchedi: WIP: add profile::dispatch [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) [09:13:48] (03PS1) 10Filippo Giunchedi: wmflib: introduce pythonloglevel type [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229) [09:14:58] (03CR) 10Filippo Giunchedi: WIP: add profile::dispatch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [09:17:40] (03PS1) 10Marostegui: mariadb: Productionize db1195 [puppet] - 10https://gerrit.wikimedia.org/r/825256 (https://phabricator.wikimedia.org/T315856) [09:18:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 5%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32706 and previous config saved to /var/cache/conftool/dbconfig/20220822-091810-root.json [09:18:24] (03CR) 10JMeybohm: [C: 03+1] Add a new signing profile for the dse_k8s cfssl-issuer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:19:32] (03CR) 10Btullis: [C: 03+2] Add a new signing profile for the dse_k8s cfssl-issuer [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:20:05] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM (+1ing my own patch due to followup)" [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff) [09:22:14] (03CR) 10JMeybohm: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:24:10] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:24:26] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1195 [puppet] - 10https://gerrit.wikimedia.org/r/825256 (https://phabricator.wikimedia.org/T315856) (owner: 10Marostegui) [09:24:42] (03CR) 10Matthias Mullie: [C: 03+1] "No reservations from us - thanks for this cleanup!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820398 (owner: 10Lucas Werkmeister (WMDE)) [09:24:46] dbproxy alerts are expected [09:25:43] (03CR) 10Btullis: [C: 03+2] Add new admin_ng values for the dse-k8s-eqiad cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:25:54] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:27:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P32708 and previous config saved to /var/cache/conftool/dbconfig/20220822-092706-marostegui.json [09:27:17] (03PS2) 10Ori: Set $wgCdnMatchParameterOrder to false by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824865 (https://phabricator.wikimedia.org/T314868) [09:27:23] any objections to me deploying a config patch? [09:27:55] (03PS1) 10Marostegui: install_server: Do not reimage db218* [puppet] - 10https://gerrit.wikimedia.org/r/825259 [09:28:27] (03PS2) 10Jbond: bullseye: apt component update [puppet] - 10https://gerrit.wikimedia.org/r/824791 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking) [09:28:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824791 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking) [09:28:57] (03PS2) 10Gergő Tisza: Declare mediawiki.createaccount_blocked_user schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno) [09:29:02] (03Merged) 10jenkins-bot: Add new admin_ng values for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:29:09] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db218* [puppet] - 10https://gerrit.wikimedia.org/r/825259 (owner: 10Marostegui) [09:30:15] (03PS2) 10Btullis: Add a dummy auth_key for the dse_k8s cluster cfssl-issuer [labs/private] - 10https://gerrit.wikimedia.org/r/824725 (https://phabricator.wikimedia.org/T310196) [09:30:37] jouncebot: nowandnext [09:30:37] No deployments scheduled for the next 3 hour(s) and 29 minute(s) [09:30:38] In 3 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T1300) [09:32:40] (03CR) 10JMeybohm: [C: 03+1] Add new admin_ng values for the dse-k8s-eqiad cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:33:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 8%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32709 and previous config saved to /var/cache/conftool/dbconfig/20220822-093314-root.json [09:34:17] * ori goes for it [09:34:23] (03CR) 10Ori: [C: 03+2] Set $wgCdnMatchParameterOrder to false by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824865 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [09:35:09] (03Merged) 10jenkins-bot: Set $wgCdnMatchParameterOrder to false by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824865 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [09:36:28] !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [09:38:03] !log push new policy on pfw3-eqiad - T315578 [09:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:39:44] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [09:41:28] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet [09:41:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:41:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:42:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T312972)', diff saved to https://phabricator.wikimedia.org/P32710 and previous config saved to /var/cache/conftool/dbconfig/20220822-094213-marostegui.json [09:42:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:42:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:42:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T312972)', diff saved to https://phabricator.wikimedia.org/P32711 and previous config saved to /var/cache/conftool/dbconfig/20220822-094234-marostegui.json [09:42:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:43:22] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:44:37] (03PS5) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) [09:44:58] !log ori@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I5ea1b1286: Set $wgCdnMatchParameterOrder to false by default (T314868) (duration: 03m 31s) [09:45:26] (03CR) 10Marostegui: [C: 03+1] Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/824039 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling) [09:48:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 10%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32712 and previous config saved to /var/cache/conftool/dbconfig/20220822-094819-root.json [09:48:20] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster2001.codfw.wmnet [09:48:45] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster1001.eqiad.wmnet [09:51:11] (03CR) 10Vgutierrez: [C: 03+2] Incremental roll-out of query-sorting (0%) [puppet] - 10https://gerrit.wikimedia.org/r/822434 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [09:52:15] (03CR) 10Ayounsi: Bump pynetbox to ~= 6.6 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/820806 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [09:52:25] (03PS1) 10Jbond: C:scap: use in clude scap vs require [puppet] - 10https://gerrit.wikimedia.org/r/825262 [09:52:56] (03PS3) 10Filippo Giunchedi: sre: port Kafka alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) [09:52:58] (03PS3) 10Filippo Giunchedi: sre: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) [09:53:18] (03CR) 10Filippo Giunchedi: sre: port Zookeeper alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:55:21] 10SRE, 10MediaWiki-General, 10Traffic, 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), 10Patch-For-Review: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) [09:57:09] (03PS2) 10Vgutierrez: Incremental roll-out of query-sorting (1%) [puppet] - 10https://gerrit.wikimedia.org/r/825232 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [09:58:08] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster1001.eqiad.wmnet [09:58:58] (KubernetesRsyslogDown) firing: rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:59:18] (03CR) 10Jbond: [C: 03+2] C:scap: use in clude scap vs require [puppet] - 10https://gerrit.wikimedia.org/r/825262 (owner: 10Jbond) [09:59:31] (03CR) 10Vgutierrez: [C: 03+2] Incremental roll-out of query-sorting (1%) [puppet] - 10https://gerrit.wikimedia.org/r/825232 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [09:59:52] (03CR) 10Jbond: [C: 03+2] R:systemd::sysuser: drop managehome parameter as it dosn;t work (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond) [10:00:17] jbond: go ahead if ori's change is in your puppet-merge session :) [10:00:32] hmm nevermind [10:00:40] (merged) [10:00:53] !log Incremental roll-out of query-sorting (1%) - T314868 [10:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:58] T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 [10:01:00] ori: ^^ [10:02:02] woot [10:03:13] what's your t-shirt size ori? just in case ;P [10:03:18] haha [10:03:20] L [10:03:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 20%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32714 and previous config saved to /var/cache/conftool/dbconfig/20220822-100324-root.json [10:03:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:04:47] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) I see that the subtask got resolved, nice! Please run the new/additional cables without connecting them. Once done... [10:05:23] (03PS1) 10Jbond: hieradata: enable systemd user on phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/825289 (https://phabricator.wikimedia.org/T315568) [10:05:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] hieradata: enable systemd user on phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/825289 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond) [10:07:31] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: vopsbot's home directory doesn't get created - https://phabricator.wikimedia.org/T315568 (10jbond) i have re-enabled systemd::sysuser on phab2002 and things seem to be working, let me know if there is still an issue [10:08:14] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Set transaction_active_timeout_out on cp4026 and cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/824484 (https://phabricator.wikimedia.org/T315533) (owner: 10Vgutierrez) [10:10:19] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/825250 (owner: 10Muehlenhoff) [10:11:24] (03PS1) 10Marostegui: site.pp: Remove insetup from db1195 [puppet] - 10https://gerrit.wikimedia.org/r/825291 (https://phabricator.wikimedia.org/T313569) [10:18:21] (03CR) 10Klausman: [C: 03+1] Add the necessary configuration to enable the dse-k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/824694 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [10:18:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 30%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32715 and previous config saved to /var/cache/conftool/dbconfig/20220822-101828-root.json [10:19:18] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:21:00] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 138, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:22:48] (03CR) 10Jbond: "see comment on naming. We can also replace the following" [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [10:25:58] (03CR) 10Filippo Giunchedi: sre: port Kafka alerts from Icinga (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:28:08] (03PS1) 10Ladsgroup: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) [10:29:36] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add the necessary configuration to enable the dse-k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/824694 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [10:30:38] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10JayCano) Hi @cmooney. I can confirm that Tšepo requires this level of access for some work that we are going to do. Thank you. [10:30:47] (03CR) 10CI reject: [V: 04-1] data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [10:33:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 40%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32716 and previous config saved to /var/cache/conftool/dbconfig/20220822-103333-root.json [10:34:10] 10SRE, 10Infrastructure-Foundations: Identity Management System for Wikimedia developer accounts - https://phabricator.wikimedia.org/T315867 (10MoritzMuehlenhoff) [10:35:11] (03CR) 10Majavah: Revert "scap: Provide a working SSH key pair for the scap keyholder agent" (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn) [10:35:38] 10SRE, 10Infrastructure-Foundations: Identity Management System for Wikimedia developer accounts - https://phabricator.wikimedia.org/T315867 (10MoritzMuehlenhoff) [10:35:42] 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: New Python base layer to manage users/groups in LDAP - https://phabricator.wikimedia.org/T313595 (10MoritzMuehlenhoff) [10:36:08] PROBLEM - Check systemd state on kubernetes1016 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T312972)', diff saved to https://phabricator.wikimedia.org/P32717 and previous config saved to /var/cache/conftool/dbconfig/20220822-104249-marostegui.json [10:42:54] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [10:43:39] (03CR) 10Muehlenhoff: [C: 03+2] Remove now redundant group [puppet] - 10https://gerrit.wikimedia.org/r/825250 (owner: 10Muehlenhoff) [10:47:58] (KubernetesRsyslogDown) firing: rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:48:10] PROBLEM - Check systemd state on dse-k8s-ctrl1001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 50%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32718 and previous config saved to /var/cache/conftool/dbconfig/20220822-104838-root.json [10:49:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:52:58] (KubernetesRsyslogDown) resolved: rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:53:58] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:54:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:57:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P32719 and previous config saved to /var/cache/conftool/dbconfig/20220822-105755-marostegui.json [11:00:19] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:01:10] (03CR) 10Jbond: [V: 03+1] O:prometheus: use map instead of reduce (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [11:01:19] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 138, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:03:20] (03CR) 10Jbond: [V: 03+1] O:prometheus: use map instead of reduce (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [11:03:23] kubernetes-eqiad BGP errors is me (should be temporary) [11:03:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 60%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32720 and previous config saved to /var/cache/conftool/dbconfig/20220822-110342-root.json [11:04:17] PROBLEM - Check systemd state on dse-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:10:39] (03PS9) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) [11:12:17] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 138, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:13:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P32721 and previous config saved to /var/cache/conftool/dbconfig/20220822-111301-marostegui.json [11:14:54] (03PS1) 10Ladsgroup: es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) [11:16:05] (03CR) 10CI reject: [V: 04-1] es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) (owner: 10Ladsgroup) [11:16:11] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet [11:17:57] RECOVERY - Check systemd state on dse-k8s-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:25] (03PS2) 10Ladsgroup: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) [11:18:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 75%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32722 and previous config saved to /var/cache/conftool/dbconfig/20220822-111847-root.json [11:20:23] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:20:25] (03PS10) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) [11:22:11] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 295, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:22:24] 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Aklapper) Hi, please follow the docs at https://wikitech.wikimedia.org/wiki/SRE/Production_access#Access_Request_Process which links to a Phabricator template - thanks! [11:23:30] (03PS2) 10Ladsgroup: es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) [11:24:07] (03CR) 10CI reject: [V: 04-1] define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:24:36] (03CR) 10CI reject: [V: 04-1] es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) (owner: 10Ladsgroup) [11:24:58] (03CR) 10Btullis: [C: 03+2] Add etcd data for dse-k8s kubeserver-api backend selection. [puppet] - 10https://gerrit.wikimedia.org/r/824705 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [11:25:24] (03PS11) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) [11:25:47] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host dse-k8s-ctrl1001.eqiad.wmnet [11:27:20] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36867/console" [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:27:47] RECOVERY - Check systemd state on dse-k8s-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T312972)', diff saved to https://phabricator.wikimedia.org/P32723 and previous config saved to /var/cache/conftool/dbconfig/20220822-112808-marostegui.json [11:28:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:28:12] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [11:28:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:28:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T312972)', diff saved to https://phabricator.wikimedia.org/P32724 and previous config saved to /var/cache/conftool/dbconfig/20220822-112829-marostegui.json [11:31:51] (03PS3) 10Ladsgroup: es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) [11:32:39] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:32:42] (03PS12) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) [11:32:43] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:33:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah) [11:33:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 100%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32725 and previous config saved to /var/cache/conftool/dbconfig/20220822-113352-root.json [11:33:56] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36868/console" [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:34:17] RECOVERY - Check systemd state on kubernetes1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:53] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 138, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:34:57] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 295, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:35:15] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:36:17] !log installing libdatetime-timezone-perl updates from SUA update [11:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:35] (03PS3) 10Jbond: C:admin: when creating users make sure we add a dependency on the shell package [puppet] - 10https://gerrit.wikimedia.org/r/824164 [11:36:41] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:38:15] 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for aline_bruenger_WMDE - https://phabricator.wikimedia.org/T315865 (10Aline_Bruenger_WMDE) [11:38:25] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Ladsgroup) a:05cmooney→03Ladsgroup Taking over as I'm on clinic duty this week. This also needs approval from @Ottomata or @odimitrijevi... [11:38:42] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db1195 [puppet] - 10https://gerrit.wikimedia.org/r/825291 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [11:38:45] RECOVERY - puppet last run on netboxdb2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:39:25] 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Aline_Bruenger_WMDE) @Aklapper , I edited my initial request according to the template. [11:39:35] 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ladsgroup) [11:40:27] (03PS1) 10Marostegui: db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825328 (https://phabricator.wikimedia.org/T315542) [11:41:53] (03PS1) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) [11:42:03] (03PS1) 10Marostegui: mariadb: Switchover es5 master [puppet] - 10https://gerrit.wikimedia.org/r/825330 (https://phabricator.wikimedia.org/T315542) [11:42:07] 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ladsgroup) While I check the notes in the checklist, this needs approval from your manager (Lea?) and analytics approval (@odimitrijevic or @Ottomata) [11:43:08] (03PS1) 10Marostegui: wmnet: Update es5-master [dns] - 10https://gerrit.wikimedia.org/r/825331 (https://phabricator.wikimedia.org/T315542) [11:44:19] (03PS1) 10Jbond: C:prometheus::ipmi_exporter: only listen on primary address [puppet] - 10https://gerrit.wikimedia.org/r/825332 (https://phabricator.wikimedia.org/T315834) [11:44:46] (03CR) 10Slyngshede: [V: 03+1] "Fixed two comments from Daniel." [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:45:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36869/console" [puppet] - 10https://gerrit.wikimedia.org/r/825332 (https://phabricator.wikimedia.org/T315834) (owner: 10Jbond) [11:45:20] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Ladsgroup) [11:46:02] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Switchover es5 master [puppet] - 10https://gerrit.wikimedia.org/r/825330 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui) [11:46:14] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update es5-master [dns] - 10https://gerrit.wikimedia.org/r/825331 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui) [11:46:31] (03CR) 10Ladsgroup: [C: 03+1] db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825328 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui) [11:46:39] (03CR) 10Marostegui: [C: 03+2] db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825328 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui) [11:47:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Switchover es5 T315542 [11:47:27] T315542: switchover es5 master es1023 -> es1024 - https://phabricator.wikimedia.org/T315542 [11:47:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Switchover es5 T315542 [11:47:47] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825328 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui) [11:51:13] !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Disable writes on es5 T315542 (duration: 03m 08s) [11:53:08] (03PS1) 10Muehlenhoff: vrts: Always install the latest version of libdatetime-timezone-perl [puppet] - 10https://gerrit.wikimedia.org/r/825333 [11:54:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:56:36] could someone with op temporarily remove the "SREs on call ..." part from the channel topic please? It's currently not kept up to date automatically [11:57:28] (03PS1) 10Stang: trwikiquote: Enable block feature of abusefilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825334 (https://phabricator.wikimedia.org/T315736) [11:58:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:58:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:00:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:01:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1024 with weight 10 T315542', diff saved to https://phabricator.wikimedia.org/P32726 and previous config saved to /var/cache/conftool/dbconfig/20220822-120141-root.json [12:01:45] T315542: switchover es5 master es1023 -> es1024 - https://phabricator.wikimedia.org/T315542 [12:02:21] PROBLEM - puppet last run on registry1004 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:02:30] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:prometheus::ipmi_exporter: only listen on primary address [puppet] - 10https://gerrit.wikimedia.org/r/825332 (https://phabricator.wikimedia.org/T315834) (owner: 10Jbond) [12:04:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Switchover es5 master [puppet] - 10https://gerrit.wikimedia.org/r/825330 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui) [12:05:33] !log Starting es5 eqiad failover from es1023 to es1024 - T315542 [12:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1024 to es5 primary T315542', diff saved to https://phabricator.wikimedia.org/P32727 and previous config saved to /var/cache/conftool/dbconfig/20220822-120611-root.json [12:06:49] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825274 [12:07:04] (03CR) 10Marostegui: [C: 03+2] wmnet: Update es5-master [dns] - 10https://gerrit.wikimedia.org/r/825331 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui) [12:07:07] (03CR) 10Ladsgroup: [C: 03+1] Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825274 (owner: 10Marostegui) [12:07:31] (03PS1) 10KartikMistry: Update cxserver to 2022-08-22-093815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/825336 (https://phabricator.wikimedia.org/T308248) [12:08:41] RECOVERY - puppet last run on registry1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:09:04] (03CR) 10Marostegui: [C: 03+2] Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825274 (owner: 10Marostegui) [12:09:53] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825274 (owner: 10Marostegui) [12:11:44] 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ladsgroup) p:05Triage→03Medium [12:12:21] (03PS2) 10Jbond: P:systemd::timesyncd: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) [12:13:23] !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Enable writes on es5 T315542 (duration: 03m 18s) [12:13:28] T315542: switchover es5 master es1023 -> es1024 - https://phabricator.wikimedia.org/T315542 [12:14:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1023 for reboot T315542', diff saved to https://phabricator.wikimedia.org/P32728 and previous config saved to /var/cache/conftool/dbconfig/20220822-121401-root.json [12:15:28] (03PS3) 10Jbond: P:systemd::timesyncd: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) [12:16:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:16:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36871/console" [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [12:16:25] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:16:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:16:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:17:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:20:06] !log fix up network config for ldap-replica2006 T273026 [12:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:10] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [12:20:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica2006.wikimedia.org [12:20:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Siko_WMDE) [12:20:56] !log kubernetes1016:~$ sudo systemctl reset-failed ifup@ens13.service - T273026 [12:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:15] RECOVERY - Host ldap-replica2006 is UP: PING OK - Packet loss = 0%, RTA = 33.47 ms [12:21:57] (03CR) 10Jbond: C:admin: when creating users make sure we add a dependency on the shell package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824164 (owner: 10Jbond) [12:22:05] (03CR) 10Jbond: [C: 03+2] C:admin: when creating users make sure we add a dependency on the shell package [puppet] - 10https://gerrit.wikimedia.org/r/824164 (owner: 10Jbond) [12:22:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T312972)', diff saved to https://phabricator.wikimedia.org/P32729 and previous config saved to /var/cache/conftool/dbconfig/20220822-122214-marostegui.json [12:22:19] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [12:26:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica2006.wikimedia.org [12:28:47] (03PS4) 10Slyngshede: c:spamassassin move Spamassassin updates from crontab to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) [12:30:05] (03PS5) 10Slyngshede: c:spamassassin move Spamassassin updates from crontab to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) [12:31:27] (03CR) 10JMeybohm: [C: 04-1] "PTR is missing" [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [12:31:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 1%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32730 and previous config saved to /var/cache/conftool/dbconfig/20220822-123135-root.json [12:31:45] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36872/console" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:33:11] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2001.codfw.wmnet [12:36:18] (03CR) 10Slyngshede: [V: 03+1] c:spamassassin move Spamassassin updates from crontab to systemd timers. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:37:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P32731 and previous config saved to /var/cache/conftool/dbconfig/20220822-123720-marostegui.json [12:39:37] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2001.codfw.wmnet [12:45:19] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2002.codfw.wmnet [12:46:08] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:46:18] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:46:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 2%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32732 and previous config saved to /var/cache/conftool/dbconfig/20220822-124640-root.json [12:48:04] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 295, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:48:57] !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad [12:49:42] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 138, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:50:57] (03PS1) 10Ladsgroup: SiteStats: Make sure initSiteStats.php re-distribute values [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825276 (https://phabricator.wikimedia.org/T315693) [12:51:45] jouncebot: nowandnext [12:51:45] No deployments scheduled for the next 0 hour(s) and 8 minute(s) [12:51:45] In 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T1300) [12:52:02] it's empty, +2ing my patch [12:52:13] (03CR) 10Ladsgroup: [C: 03+2] SiteStats: Make sure initSiteStats.php re-distribute values [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825276 (https://phabricator.wikimedia.org/T315693) (owner: 10Ladsgroup) [12:52:21] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2002.codfw.wmnet [12:52:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P32734 and previous config saved to /var/cache/conftool/dbconfig/20220822-125226-marostegui.json [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:18] * urbanecm waves [13:01:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 5%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32735 and previous config saved to /var/cache/conftool/dbconfig/20220822-130144-root.json [13:02:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Ladsgroup) p:05Triage→03Medium a:03Ladsgroup [13:03:09] 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ladsgroup) a:03Ladsgroup [13:03:19] !log disabled backup scheduling for backup1002, backup2002 T315864 [13:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:23] T315864: Switchover m1 master (db1164 -> db1195) - https://phabricator.wikimedia.org/T315864 [13:04:18] 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10WMDE-leszek) I approve this request on WMDE's behalf. [13:04:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10WMDE-leszek) I approve this request on WMDE's behalf. [13:05:07] (03PS2) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) [13:07:09] (03PS3) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) [13:07:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T312972)', diff saved to https://phabricator.wikimedia.org/P32737 and previous config saved to /var/cache/conftool/dbconfig/20220822-130732-marostegui.json [13:07:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:07:37] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [13:07:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:08:23] (03Merged) 10jenkins-bot: SiteStats: Make sure initSiteStats.php re-distribute values [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825276 (https://phabricator.wikimedia.org/T315693) (owner: 10Ladsgroup) [13:09:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:09:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:12:02] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36873/console" [puppet] - 10https://gerrit.wikimedia.org/r/824164 (owner: 10Jbond) [13:13:11] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.25/includes: Backport: [[gerrit:825276|SiteStats: Make sure initSiteStats.php re-distribute values (T315693)]] (duration: 03m 32s) [13:13:15] T315693: Inflated counts in site statistics - https://phabricator.wikimedia.org/T315693 [13:13:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:14:44] (03PS4) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) [13:14:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:14:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:14:49] (03CR) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [13:15:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:16:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 8%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32738 and previous config saved to /var/cache/conftool/dbconfig/20220822-131649-root.json [13:17:50] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1001.eqiad.wmnet [13:20:30] (03PS1) 10Clément Goubert: vopsbot: join #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/825346 [13:20:41] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Ottomata) > running monthly data dump script for similarusers It isn't clear that analytics-privatedata-users is the right group for this.... [13:21:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Backlog seems to be gone per https://phabricator.wikimedia.org/T300914#8174044 but in any case, this shouldn't hurt" [deployment-charts] - 10https://gerrit.wikimedia.org/r/820117 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [13:21:14] 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ottomata) Approved! [13:21:52] (03CR) 10Herron: WIP dispatch: add database role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [13:22:08] (03CR) 10Btullis: "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [13:22:17] (03CR) 10Btullis: [C: 03+1] P:systemd::timesyncd: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [13:24:10] (03CR) 10Clément Goubert: "Just in case we want sirenbot in operations too, but I don't know if it's ready for prime-time yet ;)" [puppet] - 10https://gerrit.wikimedia.org/r/825346 (owner: 10Clément Goubert) [13:25:09] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1001.eqiad.wmnet [13:25:10] (03PS1) 10Jbond: admin: test shell dependencies [puppet] - 10https://gerrit.wikimedia.org/r/825347 [13:25:47] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:26:21] (03CR) 10CI reject: [V: 04-1] admin: test shell dependencies [puppet] - 10https://gerrit.wikimedia.org/r/825347 (owner: 10Jbond) [13:26:43] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36874/console" [puppet] - 10https://gerrit.wikimedia.org/r/825346 (owner: 10Clément Goubert) [13:28:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P32740 and previous config saved to /var/cache/conftool/dbconfig/20220822-132808-root.json [13:28:58] (KubernetesRsyslogDown) firing: rsyslog on kubemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:30:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 5%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32741 and previous config saved to /var/cache/conftool/dbconfig/20220822-133021-root.json [13:30:35] (03CR) 10Hokwelum: "We noticed two files weren’t updated, the production-m3.sql.erb template file, and hieradata/common.yaml file. They both have some referen" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [13:31:33] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [13:31:33] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1002.eqiad.wmnet [13:31:47] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:31:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 10%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32742 and previous config saved to /var/cache/conftool/dbconfig/20220822-133154-root.json [13:32:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [13:33:27] (03PS2) 10Jbond: admin: test shell dependencies [puppet] - 10https://gerrit.wikimedia.org/r/825347 [13:33:57] (03CR) 10Filippo Giunchedi: O:prometheus: use map instead of reduce (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [13:33:58] (KubernetesRsyslogDown) resolved: rsyslog on kubemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:34:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36876/console" [puppet] - 10https://gerrit.wikimedia.org/r/825347 (owner: 10Jbond) [13:35:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [13:35:40] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Infrastructure-Foundations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Ottomata) [13:37:26] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [13:37:47] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on wdqs[1014-1016].eqiad.wmnet with reason: T314890 [13:37:51] T314890: Service implementation for wdqs101[4,5,6] - https://phabricator.wikimedia.org/T314890 [13:37:58] (03PS3) 10Jbond: admin: test shell dependencies [puppet] - 10https://gerrit.wikimedia.org/r/825347 [13:38:00] (03PS1) 10Btullis: Add an extry for dse-k8s-ctrl to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/825348 (https://phabricator.wikimedia.org/T310172) [13:38:02] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wdqs[1014-1016].eqiad.wmnet with reason: T314890 [13:38:24] (03PS2) 10Btullis: Add an entry for dse-k8s-ctrl to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/825348 (https://phabricator.wikimedia.org/T310172) [13:38:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36877/console" [puppet] - 10https://gerrit.wikimedia.org/r/825347 (owner: 10Jbond) [13:39:04] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1002.eqiad.wmnet [13:39:11] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:39:41] 10SRE, 10Analytics-Radar, 10Traffic-Icebox, 10Privacy: Add request_id to webrequest logs as well as other event records ingested into Hadoop - https://phabricator.wikimedia.org/T113817 (10Ottomata) a:05Ottomata→03None [13:41:37] (03PS4) 10Jbond: admin: Correct spelling additional_shells [puppet] - 10https://gerrit.wikimedia.org/r/825347 [13:42:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36878/console" [puppet] - 10https://gerrit.wikimedia.org/r/825347 (owner: 10Jbond) [13:42:47] (03CR) 10Jbond: [V: 03+1 C: 03+2] admin: Correct spelling additional_shells [puppet] - 10https://gerrit.wikimedia.org/r/825347 (owner: 10Jbond) [13:44:51] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [13:45:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32743 and previous config saved to /var/cache/conftool/dbconfig/20220822-134526-root.json [13:46:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:systemd::timesyncd: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [13:46:40] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:systemd::timesyncd: exclude /mnt from accessible paths (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [13:46:55] (03CR) 10Filippo Giunchedi: "LGTM overall, thanks Amir for tackling this!" [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [13:46:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 20%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32744 and previous config saved to /var/cache/conftool/dbconfig/20220822-134658-root.json [13:48:18] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) (owner: 10Ladsgroup) [13:48:39] (03PS1) 10Vgutierrez: trafficserver: Log cache read|write attempts [puppet] - 10https://gerrit.wikimedia.org/r/825350 [13:48:47] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs[1014-1016].eqiad.wmnet [13:48:48] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs[1014-1016].eqiad.wmnet [13:49:20] (03PS2) 10Muehlenhoff: raid_fact: Add new refactored raid fact [puppet] - 10https://gerrit.wikimedia.org/r/815287 (https://phabricator.wikimedia.org/T313312) (owner: 10Jbond) [13:50:50] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) I had a good discussion with @jbond on irc about how we model the host interfaces in Netbox, and I think bas... [13:53:16] (03CR) 10Filippo Giunchedi: wmflib: introduce pythonloglevel type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [13:53:21] (03PS2) 10Filippo Giunchedi: wmflib: introduce pythonloglevel type [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229) [13:53:23] (03PS5) 10Filippo Giunchedi: WIP dispatch: add database role [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) [13:53:25] (03PS5) 10Filippo Giunchedi: WIP: add profile::dispatch [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) [13:59:54] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:00:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32745 and previous config saved to /var/cache/conftool/dbconfig/20220822-140030-root.json [14:02:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 30%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32746 and previous config saved to /var/cache/conftool/dbconfig/20220822-140203-root.json [14:07:35] (03PS2) 10Eevans: eevans: replace 2048bit RSA key with new ed25519 one [puppet] - 10https://gerrit.wikimedia.org/r/824776 [14:15:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32747 and previous config saved to /var/cache/conftool/dbconfig/20220822-141535-root.json [14:16:45] (03CR) 10Eevans: [C: 03+2] eevans: replace 2048bit RSA key with new ed25519 one [puppet] - 10https://gerrit.wikimedia.org/r/824776 (owner: 10Eevans) [14:17:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 40%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32748 and previous config saved to /var/cache/conftool/dbconfig/20220822-141708-root.json [14:22:32] (03PS1) 10Marostegui: mariadb: Promote db2142 to x2 master [puppet] - 10https://gerrit.wikimedia.org/r/825354 (https://phabricator.wikimedia.org/T315853) [14:22:44] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811 [14:22:48] T313811: Switchover x2 master db2142 -> db2144 - https://phabricator.wikimedia.org/T313811 [14:22:49] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811 [14:23:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2142 with weight 0 T313811', diff saved to https://phabricator.wikimedia.org/P32749 and previous config saved to /var/cache/conftool/dbconfig/20220822-142312-marostegui.json [14:23:14] PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:24:04] PROBLEM - Check systemd state on an-worker1123 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:23] !log Starting x2 codfw failover from db2144 to db2142 - T315853 [14:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:26] T315853: reclone x2 codfw hosts - https://phabricator.wikimedia.org/T315853 [14:24:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2142 to x2 master [puppet] - 10https://gerrit.wikimedia.org/r/825354 (https://phabricator.wikimedia.org/T315853) (owner: 10Marostegui) [14:25:05] urandom: Can I merge your puppet changes? [14:26:58] marostegui: oh, does that require something other than a merge in gerrit? [14:27:08] PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:11] urandom: Yeah, it needs merging at puppetmaster1001 :) [14:27:20] PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:24] marostegui: haha, ok, yes [14:27:35] urandom: I have merged it now (sudo -i puppet-merge) [14:27:41] (03PS1) 10Klausman: Add routing for Lift Wing inference models [deployment-charts] - 10https://gerrit.wikimedia.org/r/825356 [14:27:44] marostegui: thank you [14:27:50] PROBLEM - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:54] PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:28:20] PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:29:28] PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:30:05] 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for junuche, demon, jhuneidi - https://phabricator.wikimedia.org/T315887 (10thcipriani) [14:30:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32750 and previous config saved to /var/cache/conftool/dbconfig/20220822-143040-root.json [14:31:38] (03PS2) 10Klausman: Add routing for Lift Wing inference models [deployment-charts] - 10https://gerrit.wikimedia.org/r/825356 [14:32:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 50%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32751 and previous config saved to /var/cache/conftool/dbconfig/20220822-143212-root.json [14:32:36] PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2144 to x2 primary T315853', diff saved to https://phabricator.wikimedia.org/P32752 and previous config saved to /var/cache/conftool/dbconfig/20220822-143243-root.json [14:32:48] T315853: reclone x2 codfw hosts - https://phabricator.wikimedia.org/T315853 [14:33:04] PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:34:10] RECOVERY - Check systemd state on analytics1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:56] RECOVERY - Hadoop NodeManager on analytics1060 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:35:46] RECOVERY - Check systemd state on an-worker1123 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:16] RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:38:23] !log draining ganeti2019 for reimage T311686 [14:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:27] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [14:40:10] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:45:36] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:45:52] RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:46:04] RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:34] RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:04] RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:47:44] (03PS1) 10Jbond: P:systemd::timesyncd: will need to remove the following fill after merge [puppet] - 10https://gerrit.wikimedia.org/r/825358 [14:48:58] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:49:32] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:49:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Restore x2 weight', diff saved to https://phabricator.wikimedia.org/P32754 and previous config saved to /var/cache/conftool/dbconfig/20220822-144937-marostegui.json [14:49:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32755 and previous config saved to /var/cache/conftool/dbconfig/20220822-144943-root.json [14:49:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 60%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32756 and previous config saved to /var/cache/conftool/dbconfig/20220822-144951-root.json [14:50:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2144', diff saved to https://phabricator.wikimedia.org/P32757 and previous config saved to /var/cache/conftool/dbconfig/20220822-145040-marostegui.json [14:51:22] RECOVERY - Check systemd state on an-worker1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:52] RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:54:55] !log drain ulsfo-codfw circuit for Lumen hot cut - T300716 [14:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:18] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:55:52] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:58:06] (03PS1) 10Hashar: puppet_compiler: relocate to /srv/jenkins [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) [14:58:21] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar) [15:01:34] (03CR) 10JMeybohm: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [15:02:19] (03CR) 10DCausse: "looks good, small nit about a comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [15:04:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 75%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32758 and previous config saved to /var/cache/conftool/dbconfig/20220822-150456-root.json [15:05:47] (03CR) 10Jbond: [C: 03+2] P:systemd::timesyncd: will need to remove the following fill after merge [puppet] - 10https://gerrit.wikimedia.org/r/825358 (owner: 10Jbond) [15:11:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2022/2023-Q1): Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10lmata) [15:12:15] PROBLEM - puppet last run on search-loader1001 is CRITICAL: CRITICAL: Puppet last ran 5 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:12:48] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [15:14:22] (03CR) 10Ssingh: [C: 03+1] trafficserver: Log cache read|write attempts [puppet] - 10https://gerrit.wikimedia.org/r/825350 (owner: 10Vgutierrez) [15:14:28] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [15:18:54] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36879/console" [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [15:19:58] (03CR) 10Jbond: "LGTM, FYI im on vacation from wedensday so would be good to do this early tomorrow to make sure we fix any fall out" [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar) [15:20:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 100%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32759 and previous config saved to /var/cache/conftool/dbconfig/20220822-152000-root.json [15:20:38] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) All production elastic hosts are on Bullseye now. Closing... [15:20:52] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) 05Open→03Resolved [15:20:56] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10bking) [15:20:58] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10bking) [15:21:00] 10SRE, 10Discovery-Search: Migrate Elasticsearch to Debian Buster - https://phabricator.wikimedia.org/T244736 (10bking) [15:21:35] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] wmflib: introduce pythonloglevel type [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [15:21:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/815287 (https://phabricator.wikimedia.org/T313312) (owner: 10Jbond) [15:22:06] 10SRE, 10Infrastructure-Foundations, 10observability: icinga raid montioring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10RobH) [15:22:43] 10SRE, 10Infrastructure-Foundations, 10observability: icinga raid montioring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10RobH) Thanks for the update! This was raised as a concern when I handled of dumpsdata1007 for use in service, but noted it didn't yet have accurate raid... [15:23:02] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10MoritzMuehlenhoff) Very nice! [15:24:17] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Wikidata, and 3 others: Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10Gehel) This does not seem to be related to Search / WDQS, so I'll untag the Search Platform team. Ping us aga... [15:26:14] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Log cache read|write attempts [puppet] - 10https://gerrit.wikimedia.org/r/825350 (owner: 10Vgutierrez) [15:29:17] 10SRE, 10Infrastructure-Foundations, 10observability: icinga raid montioring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10ArielGlenn) I feel a bit queasy about having a server in production without the ability to monitor the raid; what do folks think about this? [15:30:04] jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T1530). [15:30:15] RECOVERY - puppet last run on search-loader1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:30:45] (03PS1) 10Vgutierrez: Revert "trafficserver: Log cache read|write attempts" [puppet] - 10https://gerrit.wikimedia.org/r/825278 [15:30:49] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:34:53] (03CR) 10CI reject: [V: 04-1] Revert "trafficserver: Log cache read|write attempts" [puppet] - 10https://gerrit.wikimedia.org/r/825278 (owner: 10Vgutierrez) [15:36:06] (03PS2) 10Vgutierrez: Revert "trafficserver: Log cache read|write attempts" [puppet] - 10https://gerrit.wikimedia.org/r/825278 [15:39:28] (03PS1) 10Muehlenhoff: Initially adapt perccli to use the new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/825369 [15:40:07] (03PS1) 10Jbond: C:admin: add support for deprecated groups [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161) [15:40:25] (03CR) 10Vgutierrez: [C: 03+2] Revert "trafficserver: Log cache read|write attempts" [puppet] - 10https://gerrit.wikimedia.org/r/825278 (owner: 10Vgutierrez) [15:41:21] (03CR) 10CI reject: [V: 04-1] Initially adapt perccli to use the new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/825369 (owner: 10Muehlenhoff) [15:42:24] (03CR) 10CI reject: [V: 04-1] C:admin: add support for deprecated groups [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161) (owner: 10Jbond) [15:44:48] (03PS2) 10Muehlenhoff: Initially adapt perccli to use the new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/825369 [15:45:38] (03PS2) 10Jbond: C:admin: add support for deprecated groups [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161) [15:47:01] (03CR) 10CI reject: [V: 04-1] C:admin: add support for deprecated groups [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161) (owner: 10Jbond) [15:52:43] !log un-drain ulsfo-codfw circuit for Lumen hot cut - T300716 [15:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:14] (03CR) 10Hnowlan: [C: 03+2] Add routing for Lift Wing inference models [deployment-charts] - 10https://gerrit.wikimedia.org/r/825356 (owner: 10Klausman) [15:56:40] (03PS3) 10Jbond: C:admin: add support for deprecated groups [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161) [15:56:51] (03PS6) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 [15:56:53] (03CR) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [15:56:56] (03CR) 10Hashar: puppet_compiler: relocate to /srv/jenkins (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar) [15:57:08] (03PS2) 10Hashar: puppet_compiler: relocate to /srv/jenkins [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) [15:58:27] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar) [15:58:35] (03Merged) 10jenkins-bot: Add routing for Lift Wing inference models [deployment-charts] - 10https://gerrit.wikimedia.org/r/825356 (owner: 10Klausman) [16:02:11] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [16:05:36] (03PS1) 10Jgreen: Add frdev-new-eqiad.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/825372 [16:10:28] (03PS1) 10Vgutierrez: trafficserver: Log cache read|write attempts on cp6008 and cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/825375 [16:11:58] (03CR) 10Ssingh: [C: 03+1] trafficserver: Log cache read|write attempts on cp6008 and cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/825375 (owner: 10Vgutierrez) [16:12:24] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36880/console" [puppet] - 10https://gerrit.wikimedia.org/r/825375 (owner: 10Vgutierrez) [16:12:33] (03CR) 10Jgreen: [C: 03+2] Add frdev-new-eqiad.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/825372 (owner: 10Jgreen) [16:15:50] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Bonus Level 🕹️): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10hashar) This depends on whether we stick on `git-fat` (in which case we might need to do the porting, and even it is not immediately needed si... [16:16:09] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Log cache read|write attempts on cp6008 and cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/825375 (owner: 10Vgutierrez) [16:17:05] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:10] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Bonus Level 🕹️): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10demon) a:03demon [16:17:39] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Bonus Level 🕹️): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10MoritzMuehlenhoff) >>! In T279509#8174957, @hashar wrote: > This depends on whether we stick on `git-fat` (in which case we might need to do t... [16:18:59] (03CR) 10Hashar: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/1403/console failed cause there are no facts found for th" [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar) [16:21:43] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Bonus Level 🕹️): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10hashar) > Bullseye doesn't ship Python 2.7 in a supported version, it's only included to _build_ a few packages (e.g. qtwebkit). **Oops** my... [16:25:43] RECOVERY - Check systemd state on an-airflow1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:13] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:47:44] (03CR) 10Jbond: puppet_compiler: relocate to /srv/jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar) [16:51:22] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] scap: add permission mangling, reorder checks [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/822688 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [16:52:08] (03PS5) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) [16:52:52] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add a dummy auth_key for the dse_k8s cluster cfssl-issuer [labs/private] - 10https://gerrit.wikimedia.org/r/824725 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [16:57:24] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10BTullis) [17:00:05] ryankemper: That opportune time is upon us again. Time for a Wikidata Query Service weekly deploy deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T1700). [17:03:20] (03CR) 10Dzahn: "@Dduvall Would that work for you? Could you follow Majavah's advice and add it on the local puppetmaster?" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn) [17:09:26] (03CR) 10Ottomata: [C: 03+1] "@aqu, I see that you responded to my comments in the latest patches. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [17:18:45] (03PS3) 10Ebernhardson: apifeatureusage: Temporarily remove index template during 6->7 transition [puppet] - 10https://gerrit.wikimedia.org/r/815783 (https://phabricator.wikimedia.org/T313434) [17:18:51] (03PS3) 10Ebernhardson: apifeatureusage: Drop mapping type from template [puppet] - 10https://gerrit.wikimedia.org/r/815784 (https://phabricator.wikimedia.org/T313434) [17:22:36] (03PS1) 10Andrew Bogott: Keystone: expand the password safelist to specify restricted domains [puppet] - 10https://gerrit.wikimedia.org/r/825380 [17:27:39] (03CR) 10Majavah: [C: 03+1] "The idea looks good to me, and the implementation too but I don't really have a way to test it." [puppet] - 10https://gerrit.wikimedia.org/r/825380 (owner: 10Andrew Bogott) [17:28:09] (03CR) 10Dduvall: Revert "scap: Provide a working SSH key pair for the scap keyholder agent" (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn) [17:28:30] (03Abandoned) 10Dduvall: Revert "scap: Provide a working SSH key pair for the scap keyholder agent" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn) [17:29:12] (03CR) 10Majavah: "Um, why was this abandoned?" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn) [17:29:21] (03Restored) 10Dzahn: Revert "scap: Provide a working SSH key pair for the scap keyholder agent" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn) [17:29:43] (03CR) 10Dzahn: "thank you, but in this case we should actually merge this, since it was the revert of adding it in labs/private" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn) [17:30:30] (03PS2) 10Andrew Bogott: Keystone: expand the password safelist to specify restricted domains [puppet] - 10https://gerrit.wikimedia.org/r/825380 [17:38:51] (03PS1) 10Jdlrobson: Layout: Restore disabling of max width on certain pages [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) [17:47:19] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: expand the password safelist to specify restricted domains [puppet] - 10https://gerrit.wikimedia.org/r/825380 (owner: 10Andrew Bogott) [17:51:59] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "scap: Provide a working SSH key pair for the scap keyholder agent" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn) [17:55:40] (03CR) 10Dzahn: "@Hokwelum regarding the prod-m3.sql.erb file. Yea, that needs coordination with DBA (in addition to editing the file in repo). There are e" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [17:59:33] (03CR) 10CI reject: [V: 04-1] Layout: Restore disabling of max width on certain pages [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson) [18:00:33] (03PS1) 10Vgutierrez: trafficserver: Disable origin coalescing in cp601[56] [puppet] - 10https://gerrit.wikimedia.org/r/825390 (https://phabricator.wikimedia.org/T315911) [18:02:18] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36881/console" [puppet] - 10https://gerrit.wikimedia.org/r/825390 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [18:06:11] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:07] (03CR) 10BBlack: [C: 03+1] trafficserver: Disable origin coalescing in cp601[56] [puppet] - 10https://gerrit.wikimedia.org/r/825390 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [18:08:52] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Disable origin coalescing in cp601[56] [puppet] - 10https://gerrit.wikimedia.org/r/825390 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [18:12:02] !log disable origin coalescing in ats@cp601[56] - T315911 [18:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:06] T315911: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911 [18:14:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2022/2023-Q1): Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) Fixed in the following patches: 1. [[ https://gerrit.wikimedia.org/r/822196 | #822196 - netmon: Create... [18:14:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2022/2023-Q1): Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) 05Open→03Resolved [18:16:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Papaul) @Joe can you please provide me with the partman recipe to use for those servers.The description says only Raid1 . thanks [18:16:34] (03PS3) 10Htriedman: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [18:17:10] (03CR) 10CI reject: [V: 04-1] Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [18:18:22] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mc-wf2001 [18:18:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-wf2001 [18:19:02] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mc-wf2002 [18:19:06] (03Abandoned) 10Andrew Bogott: keystone: add restrict_password_auth flag [puppet] - 10https://gerrit.wikimedia.org/r/824830 (https://phabricator.wikimedia.org/T294195) (owner: 10Andrew Bogott) [18:19:26] (03CR) 10Andrew Bogott: [C: 03+2] Openstack codfw1dev to version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824886 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [18:19:35] (03CR) 10Htriedman: "incorporated proposed edits from BBlack" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [18:19:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-wf2002 [18:21:09] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:21:43] (03CR) 10Bernard Wang: "recheck" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson) [18:23:43] 10SRE, 10Observability-Logging, 10Observability-Metrics, 10Performance-Team (Radar): Framework for running experiments on a subset of the app server fleet - https://phabricator.wikimedia.org/T315403 (10Krinkle) [18:25:28] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Papaul) [18:26:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:26:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:28:39] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.219 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:28:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc-wf2001.mgmt.codfw.wmnet with reboot policy FORCED [18:30:23] (03PS1) 10Andrew Bogott: Openstack Trove: remove some file resources no longer needed in X [puppet] - 10https://gerrit.wikimedia.org/r/825394 (https://phabricator.wikimedia.org/T296561) [18:31:26] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Trove: remove some file resources no longer needed in X [puppet] - 10https://gerrit.wikimedia.org/r/825394 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [18:34:47] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:35:28] 10SRE, 10Data-Services, 10Discovery-Search, 10Wikidata, and 3 others: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Gehel) a:05Gehel→03None [18:47:56] (03PS4) 10Andrea Denisse: netmon: Create LibreNMS logs file. [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) [18:50:55] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:53:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-wf2001.mgmt.codfw.wmnet with reboot policy FORCED [18:53:54] (03PS2) 10Dzahn: phabricator: add phab1004 to list of phab hosts for firewall [puppet] - 10https://gerrit.wikimedia.org/r/824802 (https://phabricator.wikimedia.org/T280597) [18:54:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc-wf2002.mgmt.codfw.wmnet with reboot policy FORCED [18:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:59:24] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@5ac442f]: Use instance specific HDFS cache on platform_eng [18:59:34] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@5ac442f]: Use instance specific HDFS cache on platform_eng (duration: 00m 10s) [19:00:05] (03PS1) 10Jgreen: Remove frlog1001 from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/825395 (https://phabricator.wikimedia.org/T312581) [19:00:21] (03CR) 10Dzahn: [C: 03+2] "allowing connections from/to phab1004 and other phab hosts" [puppet] - 10https://gerrit.wikimedia.org/r/824802 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [19:02:09] (03CR) 10Jgreen: [C: 03+2] Remove frlog1001 from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/825395 (https://phabricator.wikimedia.org/T312581) (owner: 10Jgreen) [19:02:59] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:15] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics_test@9edd1ab]: Use instance specific HDFS cache on analytics_test [19:04:21] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics_test@9edd1ab]: Use instance specific HDFS cache on analytics_test (duration: 00m 05s) [19:09:40] 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jgreen) [19:11:09] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics_test@5ac442f]: Use instance specific HDFS cache on analytics_test [19:11:27] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics_test@5ac442f]: Use instance specific HDFS cache on analytics_test (duration: 00m 17s) [19:12:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-wf2002.mgmt.codfw.wmnet with reboot policy FORCED [19:16:55] 10SRE, 10Wikimedia-GitHub: stop syncing and delete labs/private repo from github - https://phabricator.wikimedia.org/T315925 (10Dzahn) [19:20:06] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-wf2001'] [19:27:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-wf2001'] [19:27:21] (03CR) 10Bking: [C: 03+2] bullseye: apt component update [puppet] - 10https://gerrit.wikimedia.org/r/824791 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking) [19:28:00] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-wf2002'] [19:30:12] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:18] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-wf2002'] [19:46:02] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:46:49] (03PS1) 10Papaul: Add mc-wf200[12] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/825398 (https://phabricator.wikimedia.org/T313966) [19:48:25] (03PS1) 10Andrew Bogott: Move cloudbackup100[12]-dev to Xena [puppet] - 10https://gerrit.wikimedia.org/r/825399 (https://phabricator.wikimedia.org/T296561) [19:49:19] (03CR) 10Papaul: [C: 03+2] Add mc-wf200[12] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/825398 (https://phabricator.wikimedia.org/T313966) (owner: 10Papaul) [19:49:55] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudbackup100[12]-dev to Xena [puppet] - 10https://gerrit.wikimedia.org/r/825399 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [19:50:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc-wf2001.codfw.wmnet with OS bullseye [19:50:21] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc-wf2001.codfw.wmnet with OS bullseye [19:52:35] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @Marostegui Chris is out on vacation I will take a look later to see. [20:00:04] RoanKattouw, Urbanecm, and cjming: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T2000). [20:00:04] bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:46] * urbanecm waves [20:01:02] I can deploy today [20:01:21] bwang: hi, are you around? [20:01:34] yes! [20:01:46] Great! [20:02:17] bwang: your patch seems to fail CI. Why is that happening, please? [20:03:09] ah sorry, ill check now [20:03:38] thanks [20:03:54] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@5ac442f]: Use instance specific HDFS cache on analytics [20:03:56] (03CR) 10Bernard Wang: "recheck" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson) [20:04:07] hm odd, the errors dont seem related to the patch [20:04:34] ah, should've looked at the errors first. seems to be T315892, which is now fixed [20:04:34] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@5ac442f]: Use instance specific HDFS cache on analytics (duration: 00m 40s) [20:04:39] T315892: PHPUnit\Framework\Exception: This test uses TestCase::prophesize(), but phpspec/prophecy is not installed. - https://phabricator.wikimedia.org/T315892 [20:04:46] (03CR) 10Urbanecm: [C: 03+2] Layout: Restore disabling of max width on certain pages [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson) [20:04:54] +2'ed and let's hope :) [20:07:22] (03CR) 10Cwhite: "Some suggestions inline to make the query a bit more efficient and exclude possibly changing hostnames." [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) (owner: 10Ladsgroup) [20:08:21] sounds good! [20:09:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf2001.codfw.wmnet with reason: host reimage [20:13:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf2001.codfw.wmnet with reason: host reimage [20:13:50] fails again, but that's because the fix's not in wmf.25. meh. [20:14:26] (03PS1) 10Urbanecm: composer.json: Pin phpunit to 8.5.28 [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825281 (https://phabricator.wikimedia.org/T315892) [20:14:40] (03CR) 10Urbanecm: [C: 03+2] "CI issues during backporting" [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825281 (https://phabricator.wikimedia.org/T315892) (owner: 10Urbanecm) [20:14:59] (03PS2) 10Urbanecm: Layout: Restore disabling of max width on certain pages [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson) [20:15:05] (03CR) 10Urbanecm: [C: 03+2] Layout: Restore disabling of max width on certain pages [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson) [20:15:17] bwang: trying again, this time with a proper depends-on. [20:15:31] 👍 [20:18:19] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10nskaggs) @Kelson Can you clarify how much additional space would be needed now? I saw the description... [20:20:39] (03CR) 10Andrea Denisse: "This patch sets up the correct directory to gather LibreNMS logs with logrotate." [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [20:21:24] (03CR) 10Andrea Denisse: netmon: Configure Logrotate for LibreNMS logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [20:25:48] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36885/" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [20:27:45] (03CR) 10Dzahn: [C: 03+2] "It only affects vrts1001, i'll merge it." [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [20:28:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf2001.codfw.wmnet with OS bullseye [20:28:22] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc-wf2001.codfw.wmnet with OS bullseye completed: -... [20:32:22] (03Merged) 10jenkins-bot: composer.json: Pin phpunit to 8.5.28 [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825281 (https://phabricator.wikimedia.org/T315892) (owner: 10Urbanecm) [20:33:14] (03Merged) 10jenkins-bot: Layout: Restore disabling of max width on certain pages [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson) [20:33:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc-wf2002.codfw.wmnet with OS bullseye [20:34:04] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc-wf2002.codfw.wmnet with OS bullseye [20:35:58] (03PS1) 10Ori: Incremental roll-out of query-sorting (5%) [puppet] - 10https://gerrit.wikimedia.org/r/825404 (https://phabricator.wikimedia.org/T314868) [20:36:33] (03PS2) 10Ori: Incremental roll-out of query-sorting (5%) [puppet] - 10https://gerrit.wikimedia.org/r/825404 (https://phabricator.wikimedia.org/T314868) [20:37:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:38:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:38:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:39:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:39:30] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10Jclark-ctr) replaced qsfp in e1 port 54 [20:44:25] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10cmooney) Thanks @Jclark-ctr that seems to have done it: ` cmooney@lsw1-e1-eqiad> show interfaces et-0/0/54 Aug 22 20:37:45 Physical interface: et-0/0... [20:44:52] bwang: sorry, missed the patches already merged. [20:45:08] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f3-eqiad down - https://phabricator.wikimedia.org/T315052 (10cmooney) 05Open→03Resolved [20:45:11] ok! where should i test it? [20:45:23] pulling to test srv now [20:45:57] bwang: pulled to mwdebug1001 now, please test it there :) [20:46:39] looks good! [20:47:19] thanks, syncing! [20:51:36] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.25/skins/Vector/: e0ff7634ac529acec6d298992b45b23203b682c1: Layout: Restore disabling of max width on certain pages (T315460) (duration: 03m 37s) [20:51:41] T315460: [Regression] Pages which are supposed to have full-width no longer have full-width layout - https://phabricator.wikimedia.org/T315460 [20:51:47] bwang: and should be live. thanks for your patience :). [20:52:05] thank you! [20:52:53] 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Running docker containers in a non-production environment - https://phabricator.wikimedia.org/T275551 (10fkaelin) Reviving this discussion, though I renamed the phab to "Running docker containers in a non-production environment", as the issue boils down to t... [20:53:18] 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Running docker containers in a non-production environment - https://phabricator.wikimedia.org/T275551 (10fkaelin) [20:53:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf2002.codfw.wmnet with reason: host reimage [20:56:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf2002.codfw.wmnet with reason: host reimage [20:58:52] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1195.eqiad.wmnet with OS bullseye [20:58:58] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1195.eqiad.wmnet with OS bullseye [20:59:09] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1195.eqiad.wmnet with OS bullseye [20:59:14] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1195.eqiad.wmnet with OS bullseye executed with er... [20:59:31] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye [20:59:37] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1185.eqiad.wmnet with OS bullseye [20:59:54] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1185.eqiad.wmnet with OS bullseye [20:59:59] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1185.eqiad.wmnet with OS bullseye executed with er... [21:00:05] Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T2100). [21:00:49] ^ At least one going out soon for T310763... [21:01:37] Yes! I'm here for T310763! [21:01:55] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye [21:02:01] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1185.eqiad.wmnet with OS bullseye [21:02:04] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1185.eqiad.wmnet with OS bullseye [21:02:09] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1185.eqiad.wmnet with OS bullseye executed with er... [21:03:51] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:22] !log pt1979@cumin1001 START - Cookbook sre.hosts.dhcp for host db1185.eqiad.wmnet [21:06:40] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host db1185.eqiad.wmnet [21:11:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf2002.codfw.wmnet with OS bullseye [21:11:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc-wf2002.codfw.wmnet with OS bullseye completed: - mc-wf2002 (**PASS**)... [21:13:39] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Papaul) [21:14:05] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Papaul) 05Open→03Resolved complete [21:17:13] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604 [21:17:19] T315604: Upgrade relforge cluster to 7.10.2 - https://phabricator.wikimedia.org/T315604 [21:17:38] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604 [21:19:17] 10SRE, 10Projects-Cleanup, 10fixcopyright.wikimedia.org, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) 05Open→03Resolved I believe this has been complete for some time. [21:26:35] !log Deployed security fix for T310763 [21:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:50] Confirming it's done by testing [21:30:26] 10SRE, 10Projects-Cleanup, 10fixcopyright.wikimedia.org, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Dzahn) The ticket was open because not all checkboxes were checked yet. One is the removal of the tag in Phabricator. [21:30:30] Thanks, AnaisGuetyte! [21:30:44] Or AnaisGueyte, rather. [21:30:55] All good :) [21:31:36] (03PS1) 10Bking: apt: changes to pull in latest elastic version [puppet] - 10https://gerrit.wikimedia.org/r/825413 (https://phabricator.wikimedia.org/T315604) [21:31:39] 10SRE, 10Projects-Cleanup, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Dzahn) I clicked "archive project" on https://phabricator.wikimedia.org/tag/fixcopyright.wikimedia.org/ [21:31:56] 10SRE, 10Projects-Cleanup, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Dzahn) [21:32:44] (03CR) 10Ryan Kemper: [C: 03+1] apt: changes to pull in latest elastic version [puppet] - 10https://gerrit.wikimedia.org/r/825413 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking) [21:35:07] (03CR) 10Dzahn: [C: 03+2] "File[/etc/cron.daily/spamassassin]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [21:41:39] (03CR) 10Bking: [C: 03+2] apt: changes to pull in latest elastic version [puppet] - 10https://gerrit.wikimedia.org/r/825413 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking) [21:45:56] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604 [21:46:01] T315604: Upgrade relforge cluster to 7.10.2 - https://phabricator.wikimedia.org/T315604 [21:46:08] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604 [21:53:16] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: spamassassin_updates.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:55:42] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604 [21:55:47] T315604: Upgrade relforge cluster to 7.10.2 - https://phabricator.wikimedia.org/T315604 [21:56:16] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604 [22:02:28] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:46] 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for junuche, demon, jhuneidi - https://phabricator.wikimedia.org/T315887 (10Dzahn) also see T273164 [22:09:59] (03PS1) 10Dzahn: spamassassin: fix spamassassin_updates script name in timer [puppet] - 10https://gerrit.wikimedia.org/r/825416 (https://phabricator.wikimedia.org/T273673) [22:12:53] (03CR) 10Dzahn: [C: 03+2] "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/825416" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [22:13:26] (03CR) 10Dzahn: [C: 03+2] "Aug 22 21:33:55 otrs1001 systemd[21872]: spamassassin_updates.service: Failed at step EXEC spawning /usr/local/sbin/spamassassin_updates: " [puppet] - 10https://gerrit.wikimedia.org/r/825416 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:15:17] (03CR) 10Dzahn: [C: 03+2] "ran manual rm /usr/local/sbin/spamassassin_timer.sh and systemctl start spamassassin_updates.service" [puppet] - 10https://gerrit.wikimedia.org/r/825416 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:15:56] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:07] (03CR) 10Dzahn: [C: 03+2] "<+icinga-wm> RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.or" [puppet] - 10https://gerrit.wikimedia.org/r/825416 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:19:30] (03CR) 10Dzahn: [C: 03+1] netmon: Configure Logrotate for LibreNMS logs [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [22:20:51] (03CR) 10Dzahn: [C: 03+2] vrts: Always install the latest version of libdatetime-timezone-perl [puppet] - 10https://gerrit.wikimedia.org/r/825333 (owner: 10Muehlenhoff) [22:20:58] (03PS2) 10Dzahn: vrts: Always install the latest version of libdatetime-timezone-perl [puppet] - 10https://gerrit.wikimedia.org/r/825333 (owner: 10Muehlenhoff) [22:24:32] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:28] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:26] (03CR) 10Dzahn: "yep, thanks. this was noop in prod" [puppet] - 10https://gerrit.wikimedia.org/r/825333 (owner: 10Muehlenhoff) [22:30:58] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:18] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:54] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:49:56] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: spamassassin_updates.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:56:44] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:58:53] (03CR) 10Dzahn: "thanks for fixing it! I don't know how but it works now. probably because now the user phd was already created" [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond) [22:59:00] (03CR) 10Tim Starling: [C: 03+2] Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/824039 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling) [22:59:33] (03CR) 10Dzahn: "Thank you, I guess you could not reproduce because now the phd user was already created by user{}. I'll see what happens on phab1004." [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond) [23:00:38] (03PS1) 10Zabe: Run the initsitestats period job on a daily basis [puppet] - 10https://gerrit.wikimedia.org/r/825424 (https://phabricator.wikimedia.org/T315121) [23:02:58] (03CR) 10Dzahn: [C: 03+1] "thanks! I can confirm I ran this manually before and it did not take long and also it's a real issue that creates IRC pings" [puppet] - 10https://gerrit.wikimedia.org/r/825424 (https://phabricator.wikimedia.org/T315121) (owner: 10Zabe) [23:04:25] !log Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org [23:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:34] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/36886/" [puppet] - 10https://gerrit.wikimedia.org/r/825424 (https://phabricator.wikimedia.org/T315121) (owner: 10Zabe) [23:08:29] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) The rollout was reverted back to stage 0 on August 15 due to T315271. I just reverted the revert, so it will be running on mediawiki.org once the... [23:10:45] !log tstarling@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=(appservers|api)-ro [23:11:22] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [23:18:46] (03PS2) 10RLazarus: Run the initsitestats period job on a daily basis [puppet] - 10https://gerrit.wikimedia.org/r/825424 (https://phabricator.wikimedia.org/T315121) (owner: 10Zabe) [23:19:22] (03CR) 10RLazarus: [C: 03+2] "Thanks Zabe! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/825424 (https://phabricator.wikimedia.org/T315121) (owner: 10Zabe) [23:37:50] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @Jclark-ctr whne you back on site can you please check the cable on db1185 looks like the cable is not connected. ` papaul@asw2-a-eqiad>... [23:39:51] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1187.eqiad.wmnet with OS bullseye [23:39:57] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1187.eqiad.wmnet with OS bullseye [23:50:08] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:52:32] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage [23:55:11] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage