[00:08:30] <wikibugs>	 (03PS3) 10Tim Starling: Factor out x2 per-host hieradata into an objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427)
[00:09:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] SqlBagOStuff: Fix modtoken comparison [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824445 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling)
[00:13:54] <wikibugs>	 (03Merged) 10jenkins-bot: SqlBagOStuff: Fix modtoken comparison [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824445 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling)
[00:18:48] <wikibugs>	 (03CR) 10Tim Starling: Factor out x2 per-host hieradata into an objectstash role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) (owner: 10Tim Starling)
[00:25:23] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.25/includes/objectcache/SqlBagOStuff.php: fix modtoken comparison T315271 (duration: 03m 45s)
[00:25:27] <stashbot>	 T315271: db1151, db2144 X2 masters error: Could not execute Delete_rows_v1 event on table mainstash.objectstash - https://phabricator.wikimedia.org/T315271
[00:26:16] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[00:28:26] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[00:30:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[00:34:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[00:34:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[00:38:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[01:00:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Designate codfw1dev to Xena [puppet] - 10https://gerrit.wikimedia.org/r/824885 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott)
[01:05:42] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:05:54] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:13:34] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:15:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:15:54] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:20:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:48:26] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:00:12] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:16] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:22] <icinga-wm>	 RECOVERY - snapshot of s5 in eqiad on backupmon1001 is OK: Last snapshot for s5 at eqiad (db1150) taken on 2022-08-22 01:14:01 (512 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:11:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:12:32] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:17:12] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:22:18] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:23:50] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:30:54] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:32:54] <icinga-wm>	 PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:39:42] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[02:45:00] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:47:20] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:51:18] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[03:08:04] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:12:14] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[03:15:10] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:26:22] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 6 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[03:26:56] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:31:40] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:33:54] <icinga-wm>	 RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:36:58] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:28] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:56:50] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[04:01:34] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[05:02:20] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[05:07:04] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[05:26:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Ladsgroup) Migrating it to mailman3 would help if the volume is not too large. cc. @Ottomata
[05:27:15] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10Ladsgroup) 05In progress→03Resolved a:03cmooney Given that there is no answer, I close this. Please reopen if you can't access.
[05:29:31] <wikibugs>	 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10Ladsgroup) p:05Triage→03Medium (don't mind me, SRE clinic duty)
[05:32:06] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:35:08] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:37:36] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10Ladsgroup) 05In progress→03Resolved a:03cmooney Since there hasn't been any response. I close it, reopen if you have trouble accessing.
[05:38:51] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10Ladsgroup) 05In progress→03Resolved a:05odimitrijevic→03cmooney Since there hasn't been any response. I close it, reopen if you have trouble acc...
[05:52:06] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2178 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825072 (https://phabricator.wikimedia.org/T311494)
[05:53:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2178 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825072 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[05:54:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2178 to dbctl T311494', diff saved to https://phabricator.wikimedia.org/P32651 and previous config saved to /var/cache/conftool/dbconfig/20220822-055446-marostegui.json
[05:54:51] <stashbot>	 T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494
[05:55:33] <wikibugs>	 (03PS1) 10Marostegui: db2178: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825073 (https://phabricator.wikimedia.org/T311494)
[06:04:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2178: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825073 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[06:09:10] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:10:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[06:10:42] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[06:11:18] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:11:20] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 80, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:11:32] <wikibugs>	 (03PS1) 10Marostegui: db2179: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825075 (https://phabricator.wikimedia.org/T311494)
[06:12:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2179: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825075 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[06:13:50] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2179 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825076 (https://phabricator.wikimedia.org/T311494)
[06:14:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2179 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825076 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[06:15:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[06:15:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[06:15:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[06:15:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[06:15:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2179 to dbctl T311494', diff saved to https://phabricator.wikimedia.org/P32652 and previous config saved to /var/cache/conftool/dbconfig/20220822-061553-marostegui.json
[06:15:57] <stashbot>	 T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494
[06:16:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T312972)', diff saved to https://phabricator.wikimedia.org/P32653 and previous config saved to /var/cache/conftool/dbconfig/20220822-061600-marostegui.json
[06:16:04] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[06:22:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T312972)', diff saved to https://phabricator.wikimedia.org/P32654 and previous config saved to /var/cache/conftool/dbconfig/20220822-062246-marostegui.json
[06:22:51] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[06:25:12] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, one typo inline" [puppet] - 10https://gerrit.wikimedia.org/r/824164 (owner: 10Jbond)
[06:27:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10observability: icinga raid montioring broken for H750 controllers - https://phabricator.wikimedia.org/T315608 (10MoritzMuehlenhoff) It's not broken, it's just not yet implemented :-) https://gerrit.wikimedia.org/r/c/operations/puppet/+/812250 is the main patch, but it f...
[06:28:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Factor out x2 per-host hieradata into an objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/824579 (https://phabricator.wikimedia.org/T315427) (owner: 10Tim Starling)
[06:31:21] <wikibugs>	 (03PS1) 10Marostegui: db2180: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825230 (https://phabricator.wikimedia.org/T311494)
[06:32:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2180: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825230 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[06:33:20] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:33:43] <wikibugs>	 (03PS1) 10Marostegui: db2180: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825231 (https://phabricator.wikimedia.org/T311494)
[06:34:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2180: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825231 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[06:35:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2180 to dbctl T311494', diff saved to https://phabricator.wikimedia.org/P32655 and previous config saved to /var/cache/conftool/dbconfig/20220822-063533-marostegui.json
[06:35:38] <stashbot>	 T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494
[06:37:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P32656 and previous config saved to /var/cache/conftool/dbconfig/20220822-063752-marostegui.json
[06:38:22] <marostegui>	 !log Install 10.4.26 on db1119, db1142, db1096 T315411
[06:38:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:26] <stashbot>	 T315411: Compile and package MariaDB 10.6.9 and 10.4.26 - https://phabricator.wikimedia.org/T315411
[06:38:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 db1142 db1096', diff saved to https://phabricator.wikimedia.org/P32657 and previous config saved to /var/cache/conftool/dbconfig/20220822-063857-root.json
[06:39:43] <wikibugs>	 (03PS7) 10Ori: Incremental roll-out of query-sorting (0%) [puppet] - 10https://gerrit.wikimedia.org/r/822434 (https://phabricator.wikimedia.org/T314868)
[06:39:45] <wikibugs>	 (03PS1) 10Ori: Incremental roll-out of query-sorting (1%) [puppet] - 10https://gerrit.wikimedia.org/r/825232 (https://phabricator.wikimedia.org/T314868)
[06:43:58] <ori>	 Amir1, urbanecm: I have a patch for this upcoming window, but it's the seventh and the max is six. I'm also going to be a few minutes late, need to get to a different location. If you're up for it, ping me when you're done with the other patches in the window, but if not that's OK also.
[06:44:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 1%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32658 and previous config saved to /var/cache/conftool/dbconfig/20220822-064418-root.json
[06:44:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P32659 and previous config saved to /var/cache/conftool/dbconfig/20220822-064424-root.json
[06:44:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P32660 and previous config saved to /var/cache/conftool/dbconfig/20220822-064448-root.json
[06:44:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P32661 and previous config saved to /var/cache/conftool/dbconfig/20220822-064457-root.json
[06:45:54] <Amir1>	 ori: I suggest letting the window finish and then let's do it either together or you self-serve
[06:46:06] <Amir1>	 (basically what you said :D)
[06:46:26] <ori>	 ack, sg.
[06:48:55] <wikibugs>	 (03PS1) 10Marostegui: db-production: Set es4 as RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825234 (https://phabricator.wikimedia.org/T315540)
[06:52:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P32662 and previous config saved to /var/cache/conftool/dbconfig/20220822-065258-marostegui.json
[06:54:30] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:59:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 5%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32663 and previous config saved to /var/cache/conftool/dbconfig/20220822-065923-root.json
[06:59:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P32664 and previous config saved to /var/cache/conftool/dbconfig/20220822-065929-root.json
[06:59:42] <taavi>	 ori: I think you're looking the calendar for a week ago, the upcoming window is empty
[06:59:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P32665 and previous config saved to /var/cache/conftool/dbconfig/20220822-065953-root.json
[07:00:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P32666 and previous config saved to /var/cache/conftool/dbconfig/20220822-070001-root.json
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T0700). Please do the needful.
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:20] <Amir1>	 haha
[07:00:25] <Amir1>	 ori: have fun
[07:01:42] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] db-production: Set es4 as RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825234 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui)
[07:05:17] <wikibugs>	 (03PS1) 10Marostegui: db2181: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825235 (https://phabricator.wikimedia.org/T311494)
[07:07:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2181: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825235 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[07:08:01] <wikibugs>	 (03CR) 10Muehlenhoff: P:systemd::timesyncd: allow overriding the protectsystem systemd param (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond)
[07:08:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T312972)', diff saved to https://phabricator.wikimedia.org/P32667 and previous config saved to /var/cache/conftool/dbconfig/20220822-070804-marostegui.json
[07:08:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[07:08:10] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[07:08:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[07:08:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 7 hosts with reason: Maintenance
[07:08:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 7 hosts with reason: Maintenance
[07:09:00] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2181 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825236 (https://phabricator.wikimedia.org/T311494)
[07:10:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2181 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825236 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[07:11:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2181 to dbctl T311494', diff saved to https://phabricator.wikimedia.org/P32668 and previous config saved to /var/cache/conftool/dbconfig/20220822-071153-marostegui.json
[07:11:58] <stashbot>	 T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494
[07:14:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32669 and previous config saved to /var/cache/conftool/dbconfig/20220822-071427-root.json
[07:14:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P32670 and previous config saved to /var/cache/conftool/dbconfig/20220822-071433-root.json
[07:14:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P32671 and previous config saved to /var/cache/conftool/dbconfig/20220822-071458-root.json
[07:15:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P32672 and previous config saved to /var/cache/conftool/dbconfig/20220822-071506-root.json
[07:20:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for dpifke [puppet] - 10https://gerrit.wikimedia.org/r/825238
[07:23:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[07:23:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[07:23:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T312972)', diff saved to https://phabricator.wikimedia.org/P32673 and previous config saved to /var/cache/conftool/dbconfig/20220822-072339-marostegui.json
[07:23:43] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[07:24:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for dpifke [puppet] - 10https://gerrit.wikimedia.org/r/825238 (owner: 10Muehlenhoff)
[07:26:09] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:29:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32674 and previous config saved to /var/cache/conftool/dbconfig/20220822-072932-root.json
[07:29:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P32675 and previous config saved to /var/cache/conftool/dbconfig/20220822-072938-root.json
[07:30:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P32676 and previous config saved to /var/cache/conftool/dbconfig/20220822-073002-root.json
[07:30:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P32677 and previous config saved to /var/cache/conftool/dbconfig/20220822-073010-root.json
[07:32:16] <hashar>	 ori: Amir1: the limit is merely an arbitrary suggestion. I guess at some point we found out 6 patches would fit in a one hour window
[07:33:02] <hashar>	 similar to the no deploy fridays ;D
[07:38:38] <wikibugs>	 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Marostegui) {P32679}
[07:39:29] <wikibugs>	 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Marostegui) {F35483565}
[07:44:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32681 and previous config saved to /var/cache/conftool/dbconfig/20220822-074437-root.json
[07:44:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P32682 and previous config saved to /var/cache/conftool/dbconfig/20220822-074443-root.json
[07:45:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P32683 and previous config saved to /var/cache/conftool/dbconfig/20220822-074507-root.json
[07:45:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P32684 and previous config saved to /var/cache/conftool/dbconfig/20220822-074515-root.json
[07:45:55] <urbanecm>	 ori: personally, I don’t have an issue with going over six patches, as long as there is time for everything.
[07:47:12] <ori>	 Thanks 
[07:49:21] <wikibugs>	 (03PS1) 10Marostegui: db2182: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825243 (https://phabricator.wikimedia.org/T311494)
[07:50:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2182: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/825243 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[07:51:38] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2182 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825244 (https://phabricator.wikimedia.org/T311494)
[07:52:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2182 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/825244 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[07:54:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2182 to dbctl T311494', diff saved to https://phabricator.wikimedia.org/P32685 and previous config saved to /var/cache/conftool/dbconfig/20220822-075359-marostegui.json
[07:54:04] <stashbot>	 T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494
[07:54:37] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Marostegui)
[07:59:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32686 and previous config saved to /var/cache/conftool/dbconfig/20220822-075941-root.json
[07:59:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P32687 and previous config saved to /var/cache/conftool/dbconfig/20220822-075949-root.json
[08:00:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P32688 and previous config saved to /var/cache/conftool/dbconfig/20220822-080012-root.json
[08:00:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P32689 and previous config saved to /var/cache/conftool/dbconfig/20220822-080020-root.json
[08:04:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T312972)', diff saved to https://phabricator.wikimedia.org/P32690 and previous config saved to /var/cache/conftool/dbconfig/20220822-080424-marostegui.json
[08:04:29] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[08:05:37] <wikibugs>	 (03CR) 10Wangombe: [C: 03+1] TranslatableBundleLogFormatter: Cast reason to string before passing it [extensions/Translate] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824442 (https://phabricator.wikimedia.org/T315657) (owner: 10Jforrester)
[08:05:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-production: Set es4 as RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825234 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui)
[08:06:24] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "Needs a rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[08:06:35] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "Thanks for the cleanups!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820398 (owner: 10Lucas Werkmeister (WMDE))
[08:07:00] <wikibugs>	 (03Merged) 10jenkins-bot: db-production: Set es4 as RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825234 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui)
[08:07:14] <wikibugs>	 (03PS4) 10Ladsgroup: Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[08:08:12] <icinga-wm>	 PROBLEM - SSH on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:08:22] <icinga-wm>	 PROBLEM - SSH on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:10:43] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736)
[08:10:45] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Move 0.1% of user traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823675 (https://phabricator.wikimedia.org/T271736)
[08:10:47] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Move 1% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823676 (https://phabricator.wikimedia.org/T271736)
[08:10:49] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Move 5% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823677 (https://phabricator.wikimedia.org/T271736)
[08:10:51] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736)
[08:10:53] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Move 1 of 6 users to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823679 (https://phabricator.wikimedia.org/T271736)
[08:10:55] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Move 50% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823680 (https://phabricator.wikimedia.org/T271736)
[08:10:57] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Move 100% of cookie-accepting clients to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823681 (https://phabricator.wikimedia.org/T271736)
[08:11:17] <logmsgbot>	 !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Disable writes on es4 T315540 (duration: 03m 35s)
[08:11:22] <stashbot>	 T315540: switchover es4 master es1020 -> es1021 - https://phabricator.wikimedia.org/T315540
[08:12:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add variables regulating the php 7.4 transition (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[08:12:57] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote es1021 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/825245 (https://phabricator.wikimedia.org/T315540)
[08:13:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:13:54] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update es-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/825247 (https://phabricator.wikimedia.org/T315540)
[08:14:32] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-production: Set es4 as RO" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825266
[08:14:36] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote es1021 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/825245 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui)
[08:14:39] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825266 (owner: 10Marostegui)
[08:14:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P32691 and previous config saved to /var/cache/conftool/dbconfig/20220822-081453-root.json
[08:14:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for investigating this!" [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[08:15:03] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[08:15:05] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update es-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/825247 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui)
[08:15:24] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Revert "db-production: Set es4 as RO" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825266 (owner: 10Marostegui)
[08:15:32] <hashar>	 _joe_: should I deploy your patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/823674/ ?
[08:15:36] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes202[34] implementation tracking - https://phabricator.wikimedia.org/T313871 (10JMeybohm)
[08:15:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite)
[08:15:52] <_joe_>	 hashar: I can deploy it myself, just need a +1
[08:15:56] <hashar>	 done
[08:16:10] <_joe_>	 <3
[08:16:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Prod-Kubernetes, 10serviceops: kubernetes202[34] implementation tracking - https://phabricator.wikimedia.org/T313871 (10JMeybohm)
[08:16:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[08:17:55] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Switchover es4 T315540
[08:17:59] <stashbot>	 T315540: switchover es4 master es1020 -> es1021 - https://phabricator.wikimedia.org/T315540
[08:18:01] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Switchover es4 T315540
[08:18:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:18:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:18:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1021 with weight 10 T315540', diff saved to https://phabricator.wikimedia.org/P32692 and previous config saved to /var/cache/conftool/dbconfig/20220822-081817-root.json
[08:19:03] <wikibugs>	 (03Merged) 10jenkins-bot: Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[08:19:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P32693 and previous config saved to /var/cache/conftool/dbconfig/20220822-081930-marostegui.json
[08:19:50] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1016 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:20:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:20:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es1021 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/825245 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui)
[08:21:19] <marostegui>	 !log Starting es4 eqiad failover from es1020 to es1021 - T315540
[08:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1021 to es4 primary T315540', diff saved to https://phabricator.wikimedia.org/P32694 and previous config saved to /var/cache/conftool/dbconfig/20220822-082208-root.json
[08:23:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Update es-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/825247 (https://phabricator.wikimedia.org/T315540) (owner: 10Marostegui)
[08:23:28] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1015 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:24:58] <icinga-wm>	 PROBLEM - SSH on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:25:12] <wikibugs>	 (03CR) 10Marostegui: Revert "db-production: Set es4 as RO" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825266 (owner: 10Marostegui)
[08:25:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-production: Set es4 as RO" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825266 (owner: 10Marostegui)
[08:25:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: WIP dispatch: add database role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[08:25:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:26:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-production: Set es4 as RO" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825266 (owner: 10Marostegui)
[08:26:30] <dcausse>	 sigh... the 3 new wdqs nodes wdqs1014, wdqs1015 and wdqs1016 seem to falling apart, can't connect to them, is there anyone available to have a quick look to them?
[08:29:00] <logmsgbot>	 !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Introducing variables for php 7.4 migration (duration: 03m 39s)
[08:29:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P32695 and previous config saved to /var/cache/conftool/dbconfig/20220822-082958-root.json
[08:30:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:30:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:31:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:31:34] <moritzm>	 dcausse: I can't even log in via the serial console, on wdqs1014 I can only see a wdqs-categories spewing 100s lines of error messages per second. Shall I just powercycle?
[08:31:55] <moritzm>	 (I get a serial console, but no tty login is possible)
[08:32:18] <dcausse>	 moritzm: thanks for looking! yes please :)
[08:32:23] <logmsgbot>	 !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Enable writes on es4 T315540 (duration: 03m 17s)
[08:32:27] <stashbot>	 T315540: switchover es4 master es1020 -> es1021 - https://phabricator.wikimedia.org/T315540
[08:32:48] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[08:33:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1020 for reboot T310485', diff saved to https://phabricator.wikimedia.org/P32696 and previous config saved to /var/cache/conftool/dbconfig/20220822-083341-root.json
[08:33:52] <moritzm>	 !log powercycling wdqs1014 (unresponsive via botched wdqs-categories process
[08:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P32697 and previous config saved to /var/cache/conftool/dbconfig/20220822-083436-marostegui.json
[08:34:58] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[08:35:10] <icinga-wm>	 PROBLEM - Host wdqs1014 is DOWN: PING CRITICAL - Packet loss = 100%
[08:36:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:36:36] <icinga-wm>	 RECOVERY - Host wdqs1014 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[08:37:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:37:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:37:22] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:37:30] <icinga-wm>	 RECOVERY - SSH on wdqs1014 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:38:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:40:01] <moritzm>	 dcausse: same issue on wdqs1015, also going to powercycle it
[08:40:54] <moritzm>	 although, actually I can log in just fine (although the console keeps getting spammed the same manner as 1014)
[08:41:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10serviceops: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Ladsgroup) there is no checklist here, is there anything left before closing this ticket?
[08:42:36] <moritzm>	 dcausse: same thing for 1016
[08:42:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10serviceops: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert No, we're done I think. Closing.
[08:43:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 1%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32698 and previous config saved to /var/cache/conftool/dbconfig/20220822-084335-root.json
[08:43:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1020 ', diff saved to https://phabricator.wikimedia.org/P32699 and previous config saved to /var/cache/conftool/dbconfig/20220822-084359-root.json
[08:46:58] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:47:06] <dcausse>	 moritzm: thanks!
[08:48:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 1%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32700 and previous config saved to /var/cache/conftool/dbconfig/20220822-084800-root.json
[08:49:24] <icinga-wm>	 RECOVERY - SSH on wdqs1016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:49:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T312972)', diff saved to https://phabricator.wikimedia.org/P32701 and previous config saved to /var/cache/conftool/dbconfig/20220822-084942-marostegui.json
[08:49:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[08:49:47] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[08:50:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[08:50:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T312972)', diff saved to https://phabricator.wikimedia.org/P32702 and previous config saved to /var/cache/conftool/dbconfig/20220822-085014-marostegui.json
[08:50:32] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Acknowledged: T315850 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:54:16] <icinga-wm>	 RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:55:46] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Feel free to ignore the naming nit. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[08:56:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T312972)', diff saved to https://phabricator.wikimedia.org/P32703 and previous config saved to /var/cache/conftool/dbconfig/20220822-085654-marostegui.json
[08:57:01] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[08:57:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove now redundant group [puppet] - 10https://gerrit.wikimedia.org/r/825250
[08:59:11] <wikibugs>	 (03PS2) 10Btullis: Add a new signing profile for the dse_k8s cfssl-issuer [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196)
[09:02:59] <wikibugs>	 (03PS4) 10Btullis: Add new admin_ng values for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196)
[09:03:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 2%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32704 and previous config saved to /var/cache/conftool/dbconfig/20220822-090305-root.json
[09:10:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) Any ETA for getting db1187 and db1185 online?
[09:11:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui)
[09:11:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/824776 (owner: 10Eevans)
[09:12:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P32705 and previous config saved to /var/cache/conftool/dbconfig/20220822-091200-marostegui.json
[09:12:09] <wikibugs>	 (03PS2) 10Tim Starling: Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/824039 (https://phabricator.wikimedia.org/T315271)
[09:13:44] <wikibugs>	 (03PS4) 10Filippo Giunchedi: WIP dispatch: add database role [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229)
[09:13:46] <wikibugs>	 (03PS4) 10Filippo Giunchedi: WIP: add profile::dispatch [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229)
[09:13:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmflib: introduce pythonloglevel type [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229)
[09:14:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: WIP: add profile::dispatch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[09:17:40] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1195 [puppet] - 10https://gerrit.wikimedia.org/r/825256 (https://phabricator.wikimedia.org/T315856)
[09:18:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 5%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32706 and previous config saved to /var/cache/conftool/dbconfig/20220822-091810-root.json
[09:18:24] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add a new signing profile for the dse_k8s cfssl-issuer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[09:19:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add a new signing profile for the dse_k8s cfssl-issuer [puppet] - 10https://gerrit.wikimedia.org/r/824723 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[09:20:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM (+1ing my own patch due to followup)" [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff)
[09:22:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[09:24:10] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:24:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1195 [puppet] - 10https://gerrit.wikimedia.org/r/825256 (https://phabricator.wikimedia.org/T315856) (owner: 10Marostegui)
[09:24:42] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+1] "No reservations from us - thanks for this cleanup!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820398 (owner: 10Lucas Werkmeister (WMDE))
[09:24:46] <marostegui>	 dbproxy alerts are expected
[09:25:43] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add new admin_ng values for the dse-k8s-eqiad cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[09:25:54] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:27:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P32708 and previous config saved to /var/cache/conftool/dbconfig/20220822-092706-marostegui.json
[09:27:17] <wikibugs>	 (03PS2) 10Ori: Set $wgCdnMatchParameterOrder to false by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824865 (https://phabricator.wikimedia.org/T314868)
[09:27:23] <ori>	 any objections to me deploying a config patch? 
[09:27:55] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db218* [puppet] - 10https://gerrit.wikimedia.org/r/825259
[09:28:27] <wikibugs>	 (03PS2) 10Jbond: bullseye: apt component update [puppet] - 10https://gerrit.wikimedia.org/r/824791 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking)
[09:28:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824791 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking)
[09:28:57] <wikibugs>	 (03PS2) 10Gergő Tisza: Declare mediawiki.createaccount_blocked_user schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno)
[09:29:02] <wikibugs>	 (03Merged) 10jenkins-bot: Add new admin_ng values for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[09:29:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db218* [puppet] - 10https://gerrit.wikimedia.org/r/825259 (owner: 10Marostegui)
[09:30:15] <wikibugs>	 (03PS2) 10Btullis: Add a dummy auth_key for the dse_k8s cluster cfssl-issuer [labs/private] - 10https://gerrit.wikimedia.org/r/824725 (https://phabricator.wikimedia.org/T310196)
[09:30:37] <ori>	 jouncebot: nowandnext
[09:30:37] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 29 minute(s)
[09:30:38] <jouncebot>	 In 3 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T1300)
[09:32:40] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add new admin_ng values for the dse-k8s-eqiad cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[09:33:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 8%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32709 and previous config saved to /var/cache/conftool/dbconfig/20220822-093314-root.json
[09:34:17] * ori goes for it
[09:34:23] <wikibugs>	 (03CR) 10Ori: [C: 03+2] Set $wgCdnMatchParameterOrder to false by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824865 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori)
[09:35:09] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgCdnMatchParameterOrder to false by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824865 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori)
[09:36:28] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad
[09:38:03] <XioNoX>	 !log push new policy on pfw3-eqiad - T315578
[09:38:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:39:44] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[09:41:28] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet
[09:41:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:41:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:42:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T312972)', diff saved to https://phabricator.wikimedia.org/P32710 and previous config saved to /var/cache/conftool/dbconfig/20220822-094213-marostegui.json
[09:42:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[09:42:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[09:42:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T312972)', diff saved to https://phabricator.wikimedia.org/P32711 and previous config saved to /var/cache/conftool/dbconfig/20220822-094234-marostegui.json
[09:42:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:43:22] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:44:37] <wikibugs_>	 (03PS5) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663)
[09:44:58] <logmsgbot>	 !log ori@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I5ea1b1286: Set $wgCdnMatchParameterOrder to false by default (T314868) (duration: 03m 31s)
[09:45:26] <wikibugs_>	 (03CR) 10Marostegui: [C: 03+1] Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/824039 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling)
[09:48:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 10%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32712 and previous config saved to /var/cache/conftool/dbconfig/20220822-094819-root.json
[09:48:20] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster2001.codfw.wmnet
[09:48:45] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster1001.eqiad.wmnet
[09:51:11] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Incremental roll-out of query-sorting (0%) [puppet] - 10https://gerrit.wikimedia.org/r/822434 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori)
[09:52:15] <wikibugs>	 (03CR) 10Ayounsi: Bump pynetbox to ~= 6.6 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/820806 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi)
[09:52:25] <wikibugs>	 (03PS1) 10Jbond: C:scap: use in clude scap vs require [puppet] - 10https://gerrit.wikimedia.org/r/825262
[09:52:56] <wikibugs>	 (03PS3) 10Filippo Giunchedi: sre: port Kafka alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847)
[09:52:58] <wikibugs>	 (03PS3) 10Filippo Giunchedi: sre: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847)
[09:53:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: sre: port Zookeeper alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[09:55:21] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic, 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), 10Patch-For-Review: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori)
[09:57:09] <wikibugs>	 (03PS2) 10Vgutierrez: Incremental roll-out of query-sorting (1%) [puppet] - 10https://gerrit.wikimedia.org/r/825232 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori)
[09:58:08] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster1001.eqiad.wmnet
[09:58:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:59:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:scap: use in clude scap vs require [puppet] - 10https://gerrit.wikimedia.org/r/825262 (owner: 10Jbond)
[09:59:31] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Incremental roll-out of query-sorting (1%) [puppet] - 10https://gerrit.wikimedia.org/r/825232 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori)
[09:59:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] R:systemd::sysuser: drop managehome parameter as it dosn;t work (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond)
[10:00:17] <vgutierrez>	 jbond: go ahead if ori's change is in your puppet-merge session :)
[10:00:32] <vgutierrez>	 hmm nevermind
[10:00:40] <vgutierrez>	 (merged)
[10:00:53] <vgutierrez>	 !log Incremental roll-out of query-sorting (1%) - T314868
[10:00:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:58] <stashbot>	 T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868
[10:01:00] <vgutierrez>	 ori: ^^
[10:02:02] <ori>	 woot
[10:03:13] <vgutierrez>	 what's your t-shirt size ori? just in case ;P
[10:03:18] <ori>	 haha
[10:03:20] <ori>	 L
[10:03:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 20%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32714 and previous config saved to /var/cache/conftool/dbconfig/20220822-100324-root.json
[10:03:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:04:47] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) I see that the subtask got resolved, nice!  Please run the new/additional cables without connecting them. Once done...
[10:05:23] <wikibugs>	 (03PS1) 10Jbond: hieradata: enable systemd user on phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/825289 (https://phabricator.wikimedia.org/T315568)
[10:05:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] hieradata: enable systemd user on phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/825289 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond)
[10:07:31] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: vopsbot's home directory doesn't get created - https://phabricator.wikimedia.org/T315568 (10jbond) i have re-enabled systemd::sysuser on phab2002 and things seem to be working, let me know if there is still an issue
[10:08:14] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Set transaction_active_timeout_out on cp4026 and cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/824484 (https://phabricator.wikimedia.org/T315533) (owner: 10Vgutierrez)
[10:10:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/825250 (owner: 10Muehlenhoff)
[10:11:24] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove insetup from db1195 [puppet] - 10https://gerrit.wikimedia.org/r/825291 (https://phabricator.wikimedia.org/T313569)
[10:18:21] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add the necessary configuration to enable the dse-k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/824694 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[10:18:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 30%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32715 and previous config saved to /var/cache/conftool/dbconfig/20220822-101828-root.json
[10:19:18] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:21:00] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 138, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:22:48] <wikibugs>	 (03CR) 10Jbond: "see comment on naming.  We can also replace the following" [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[10:25:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: sre: port Kafka alerts from Icinga (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[10:28:08] <wikibugs>	 (03PS1) 10Ladsgroup: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866)
[10:29:36] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Add the necessary configuration to enable the dse-k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/824694 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[10:30:38] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10JayCano) Hi @cmooney. I can confirm that Tšepo requires this level of access for some work that we are going to do. Thank you.
[10:30:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup)
[10:33:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 40%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32716 and previous config saved to /var/cache/conftool/dbconfig/20220822-103333-root.json
[10:34:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Identity Management System for Wikimedia developer accounts - https://phabricator.wikimedia.org/T315867 (10MoritzMuehlenhoff)
[10:35:11] <wikibugs>	 (03CR) 10Majavah: Revert "scap: Provide a working SSH key pair for the scap keyholder agent" (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn)
[10:35:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Identity Management System for Wikimedia developer accounts - https://phabricator.wikimedia.org/T315867 (10MoritzMuehlenhoff)
[10:35:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: New Python base layer to manage users/groups in LDAP - https://phabricator.wikimedia.org/T313595 (10MoritzMuehlenhoff)
[10:36:08] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1016 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T312972)', diff saved to https://phabricator.wikimedia.org/P32717 and previous config saved to /var/cache/conftool/dbconfig/20220822-104249-marostegui.json
[10:42:54] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[10:43:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove now redundant group [puppet] - 10https://gerrit.wikimedia.org/r/825250 (owner: 10Muehlenhoff)
[10:47:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:48:10] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-ctrl1001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:48:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 50%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32718 and previous config saved to /var/cache/conftool/dbconfig/20220822-104838-root.json
[10:49:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:52:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:53:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:54:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:57:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P32719 and previous config saved to /var/cache/conftool/dbconfig/20220822-105755-marostegui.json
[11:00:19] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:01:10] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] O:prometheus: use map instead of reduce (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[11:01:19] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 138, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:03:20] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] O:prometheus: use map instead of reduce (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[11:03:23] <jayme>	 kubernetes-eqiad BGP errors is me (should be temporary)
[11:03:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 60%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32720 and previous config saved to /var/cache/conftool/dbconfig/20220822-110342-root.json
[11:04:17] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:09:27] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:10:39] <wikibugs>	 (03PS9) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673)
[11:12:17] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 138, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:13:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P32721 and previous config saved to /var/cache/conftool/dbconfig/20220822-111301-marostegui.json
[11:14:54] <wikibugs>	 (03PS1) 10Ladsgroup: es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435)
[11:16:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) (owner: 10Ladsgroup)
[11:16:11] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet
[11:17:57] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:18:25] <wikibugs>	 (03PS2) 10Ladsgroup: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866)
[11:18:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 75%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32722 and previous config saved to /var/cache/conftool/dbconfig/20220822-111847-root.json
[11:20:23] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:20:25] <wikibugs>	 (03PS10) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673)
[11:22:11] <icinga-wm>	 RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 295, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:22:24] <wikibugs>	 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Aklapper) Hi, please follow the docs at https://wikitech.wikimedia.org/wiki/SRE/Production_access#Access_Request_Process which links to a Phabricator template - thanks!
[11:23:30] <wikibugs>	 (03PS2) 10Ladsgroup: es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435)
[11:24:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:24:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) (owner: 10Ladsgroup)
[11:24:58] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add etcd data for dse-k8s kubeserver-api backend selection. [puppet] - 10https://gerrit.wikimedia.org/r/824705 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis)
[11:25:24] <wikibugs>	 (03PS11) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673)
[11:25:47] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host dse-k8s-ctrl1001.eqiad.wmnet
[11:27:20] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36867/console" [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:27:47] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:28:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T312972)', diff saved to https://phabricator.wikimedia.org/P32723 and previous config saved to /var/cache/conftool/dbconfig/20220822-112808-marostegui.json
[11:28:10] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[11:28:12] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[11:28:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[11:28:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T312972)', diff saved to https://phabricator.wikimedia.org/P32724 and previous config saved to /var/cache/conftool/dbconfig/20220822-112829-marostegui.json
[11:31:51] <wikibugs>	 (03PS3) 10Ladsgroup: es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435)
[11:32:39] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:32:42] <wikibugs>	 (03PS12) 10Slyngshede: define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673)
[11:32:43] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:33:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah)
[11:33:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 100%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32725 and previous config saved to /var/cache/conftool/dbconfig/20220822-113352-root.json
[11:33:56] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36868/console" [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:34:17] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:34:53] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 138, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:34:57] <icinga-wm>	 RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 295, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:35:15] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:36:17] <moritzm>	 !log installing libdatetime-timezone-perl updates from SUA update
[11:36:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:35] <wikibugs>	 (03PS3) 10Jbond: C:admin: when creating users make sure we add a dependency on the shell package [puppet] - 10https://gerrit.wikimedia.org/r/824164
[11:36:41] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:38:15] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for aline_bruenger_WMDE - https://phabricator.wikimedia.org/T315865 (10Aline_Bruenger_WMDE)
[11:38:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Ladsgroup) a:05cmooney→03Ladsgroup Taking over as I'm on clinic duty this week.  This also needs approval from @Ottomata or @odimitrijevi...
[11:38:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db1195 [puppet] - 10https://gerrit.wikimedia.org/r/825291 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[11:38:45] <icinga-wm>	 RECOVERY - puppet last run on netboxdb2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:39:25] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Aline_Bruenger_WMDE) @Aklapper , I edited my initial request according to the template.
[11:39:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ladsgroup)
[11:40:27] <wikibugs>	 (03PS1) 10Marostegui: db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825328 (https://phabricator.wikimedia.org/T315542)
[11:41:53] <wikibugs>	 (03PS1) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196)
[11:42:03] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Switchover es5 master [puppet] - 10https://gerrit.wikimedia.org/r/825330 (https://phabricator.wikimedia.org/T315542)
[11:42:07] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ladsgroup) While I check the notes in the checklist, this needs approval from your manager (Lea?) and analytics approval (@odimitrijevic or @Ottomata)
[11:43:08] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update es5-master [dns] - 10https://gerrit.wikimedia.org/r/825331 (https://phabricator.wikimedia.org/T315542)
[11:44:19] <wikibugs>	 (03PS1) 10Jbond: C:prometheus::ipmi_exporter: only listen on primary address [puppet] - 10https://gerrit.wikimedia.org/r/825332 (https://phabricator.wikimedia.org/T315834)
[11:44:46] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "Fixed two comments from Daniel." [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:45:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36869/console" [puppet] - 10https://gerrit.wikimedia.org/r/825332 (https://phabricator.wikimedia.org/T315834) (owner: 10Jbond)
[11:45:20] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Ladsgroup)
[11:46:02] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] mariadb: Switchover es5 master [puppet] - 10https://gerrit.wikimedia.org/r/825330 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui)
[11:46:14] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update es5-master [dns] - 10https://gerrit.wikimedia.org/r/825331 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui)
[11:46:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825328 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui)
[11:46:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825328 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui)
[11:47:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Switchover es5 T315542
[11:47:27] <stashbot>	 T315542: switchover es5 master es1023 -> es1024 - https://phabricator.wikimedia.org/T315542
[11:47:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Switchover es5 T315542
[11:47:47] <wikibugs>	 (03Merged) 10jenkins-bot: db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825328 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui)
[11:51:13] <logmsgbot>	 !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Disable writes on es5 T315542 (duration: 03m 08s)
[11:53:08] <wikibugs>	 (03PS1) 10Muehlenhoff: vrts: Always install the latest version of libdatetime-timezone-perl [puppet] - 10https://gerrit.wikimedia.org/r/825333
[11:54:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[11:56:36] <jayme>	 could someone with op temporarily remove the "SREs on call ..." part from the channel topic please? It's currently not kept up to date automatically 
[11:57:28] <wikibugs>	 (03PS1) 10Stang: trwikiquote: Enable block feature of abusefilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825334 (https://phabricator.wikimedia.org/T315736)
[11:58:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[11:58:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[12:00:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[12:01:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1024 with weight 10 T315542', diff saved to https://phabricator.wikimedia.org/P32726 and previous config saved to /var/cache/conftool/dbconfig/20220822-120141-root.json
[12:01:45] <stashbot>	 T315542: switchover es5 master es1023 -> es1024 - https://phabricator.wikimedia.org/T315542
[12:02:21] <icinga-wm>	 PROBLEM - puppet last run on registry1004 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:02:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:prometheus::ipmi_exporter: only listen on primary address [puppet] - 10https://gerrit.wikimedia.org/r/825332 (https://phabricator.wikimedia.org/T315834) (owner: 10Jbond)
[12:04:43] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Switchover es5 master [puppet] - 10https://gerrit.wikimedia.org/r/825330 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui)
[12:05:33] <marostegui>	 !log Starting es5 eqiad failover from es1023 to es1024 - T315542
[12:05:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1024 to es5 primary T315542', diff saved to https://phabricator.wikimedia.org/P32727 and previous config saved to /var/cache/conftool/dbconfig/20220822-120611-root.json
[12:06:49] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825274
[12:07:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Update es5-master [dns] - 10https://gerrit.wikimedia.org/r/825331 (https://phabricator.wikimedia.org/T315542) (owner: 10Marostegui)
[12:07:07] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825274 (owner: 10Marostegui)
[12:07:31] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2022-08-22-093815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/825336 (https://phabricator.wikimedia.org/T308248)
[12:08:41] <icinga-wm>	 RECOVERY - puppet last run on registry1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:09:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825274 (owner: 10Marostegui)
[12:09:53] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825274 (owner: 10Marostegui)
[12:11:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ladsgroup) p:05Triage→03Medium
[12:12:21] <wikibugs>	 (03PS2) 10Jbond: P:systemd::timesyncd: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643)
[12:13:23] <logmsgbot>	 !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Enable writes on es5 T315542 (duration: 03m 18s)
[12:13:28] <stashbot>	 T315542: switchover es5 master es1023 -> es1024 - https://phabricator.wikimedia.org/T315542
[12:14:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1023 for reboot T315542', diff saved to https://phabricator.wikimedia.org/P32728 and previous config saved to /var/cache/conftool/dbconfig/20220822-121401-root.json
[12:15:28] <wikibugs>	 (03PS3) 10Jbond: P:systemd::timesyncd: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643)
[12:16:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[12:16:23] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36871/console" [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond)
[12:16:25] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[12:16:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[12:16:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[12:17:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[12:20:06] <moritzm>	 !log fix up network config for ldap-replica2006 T273026
[12:20:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:10] <stashbot>	 T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026
[12:20:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica2006.wikimedia.org
[12:20:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Siko_WMDE)
[12:20:56] <jayme>	 !log kubernetes1016:~$ sudo systemctl reset-failed ifup@ens13.service - T273026
[12:20:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:15] <icinga-wm>	 RECOVERY - Host ldap-replica2006 is UP: PING OK - Packet loss = 0%, RTA = 33.47 ms
[12:21:57] <wikibugs>	 (03CR) 10Jbond: C:admin: when creating users make sure we add a dependency on the shell package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824164 (owner: 10Jbond)
[12:22:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:admin: when creating users make sure we add a dependency on the shell package [puppet] - 10https://gerrit.wikimedia.org/r/824164 (owner: 10Jbond)
[12:22:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T312972)', diff saved to https://phabricator.wikimedia.org/P32729 and previous config saved to /var/cache/conftool/dbconfig/20220822-122214-marostegui.json
[12:22:19] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[12:26:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica2006.wikimedia.org
[12:28:47] <wikibugs>	 (03PS4) 10Slyngshede: c:spamassassin move Spamassassin updates from crontab to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673)
[12:30:05] <wikibugs>	 (03PS5) 10Slyngshede: c:spamassassin move Spamassassin updates from crontab to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673)
[12:31:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "PTR is missing" [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[12:31:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 1%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32730 and previous config saved to /var/cache/conftool/dbconfig/20220822-123135-root.json
[12:31:45] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36872/console" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[12:33:11] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2001.codfw.wmnet
[12:36:18] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] c:spamassassin move Spamassassin updates from crontab to systemd timers. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[12:37:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P32731 and previous config saved to /var/cache/conftool/dbconfig/20220822-123720-marostegui.json
[12:39:37] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2001.codfw.wmnet
[12:45:19] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2002.codfw.wmnet
[12:46:08] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:46:18] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:46:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 2%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32732 and previous config saved to /var/cache/conftool/dbconfig/20220822-124640-root.json
[12:48:04] <icinga-wm>	 RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 295, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:48:57] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad
[12:49:42] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 138, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:50:57] <wikibugs>	 (03PS1) 10Ladsgroup: SiteStats: Make sure initSiteStats.php re-distribute values [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825276 (https://phabricator.wikimedia.org/T315693)
[12:51:45] <Amir1>	 jouncebot: nowandnext
[12:51:45] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 8 minute(s)
[12:51:45] <jouncebot>	 In 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T1300)
[12:52:02] <Amir1>	 it's empty, +2ing my patch
[12:52:13] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] SiteStats: Make sure initSiteStats.php re-distribute values [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825276 (https://phabricator.wikimedia.org/T315693) (owner: 10Ladsgroup)
[12:52:21] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2002.codfw.wmnet
[12:52:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P32734 and previous config saved to /var/cache/conftool/dbconfig/20220822-125226-marostegui.json
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:18] * urbanecm waves
[13:01:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 5%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32735 and previous config saved to /var/cache/conftool/dbconfig/20220822-130144-root.json
[13:02:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Ladsgroup) p:05Triage→03Medium a:03Ladsgroup
[13:03:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ladsgroup) a:03Ladsgroup
[13:03:19] <jynus>	 !log disabled backup scheduling for backup1002, backup2002 T315864
[13:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:23] <stashbot>	 T315864: Switchover m1 master (db1164 -> db1195) - https://phabricator.wikimedia.org/T315864
[13:04:18] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10WMDE-leszek) I approve this request on WMDE's behalf.
[13:04:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10WMDE-leszek) I approve this request on WMDE's behalf.
[13:05:07] <wikibugs>	 (03PS2) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196)
[13:07:09] <wikibugs>	 (03PS3) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196)
[13:07:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T312972)', diff saved to https://phabricator.wikimedia.org/P32737 and previous config saved to /var/cache/conftool/dbconfig/20220822-130732-marostegui.json
[13:07:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[13:07:37] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[13:07:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[13:08:23] <wikibugs>	 (03Merged) 10jenkins-bot: SiteStats: Make sure initSiteStats.php re-distribute values [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825276 (https://phabricator.wikimedia.org/T315693) (owner: 10Ladsgroup)
[13:09:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[13:09:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[13:12:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36873/console" [puppet] - 10https://gerrit.wikimedia.org/r/824164 (owner: 10Jbond)
[13:13:11] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.25/includes: Backport: [[gerrit:825276|SiteStats: Make sure initSiteStats.php re-distribute values (T315693)]] (duration: 03m 32s)
[13:13:15] <stashbot>	 T315693: Inflated counts in site statistics - https://phabricator.wikimedia.org/T315693
[13:13:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:14:44] <wikibugs>	 (03PS4) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196)
[13:14:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:14:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:14:49] <wikibugs>	 (03CR) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[13:15:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:16:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 8%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32738 and previous config saved to /var/cache/conftool/dbconfig/20220822-131649-root.json
[13:17:50] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1001.eqiad.wmnet
[13:20:30] <wikibugs>	 (03PS1) 10Clément Goubert: vopsbot: join #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/825346
[13:20:41] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Ottomata) > running monthly data dump script for similarusers  It isn't clear that analytics-privatedata-users is the right group for this....
[13:21:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Backlog seems to be gone per https://phabricator.wikimedia.org/T300914#8174044 but in any case, this shouldn't hurt" [deployment-charts] - 10https://gerrit.wikimedia.org/r/820117 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan)
[13:21:14] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ottomata) Approved!
[13:21:52] <wikibugs>	 (03CR) 10Herron: WIP dispatch: add database role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[13:22:08] <wikibugs>	 (03CR) 10Btullis: "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond)
[13:22:17] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] P:systemd::timesyncd: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond)
[13:24:10] <wikibugs>	 (03CR) 10Clément Goubert: "Just in case we want sirenbot in operations too, but I don't know if it's ready for prime-time yet ;)" [puppet] - 10https://gerrit.wikimedia.org/r/825346 (owner: 10Clément Goubert)
[13:25:09] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1001.eqiad.wmnet
[13:25:10] <wikibugs>	 (03PS1) 10Jbond: admin: test shell dependencies [puppet] - 10https://gerrit.wikimedia.org/r/825347
[13:25:47] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[13:26:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: test shell dependencies [puppet] - 10https://gerrit.wikimedia.org/r/825347 (owner: 10Jbond)
[13:26:43] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36874/console" [puppet] - 10https://gerrit.wikimedia.org/r/825346 (owner: 10Clément Goubert)
[13:28:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P32740 and previous config saved to /var/cache/conftool/dbconfig/20220822-132808-root.json
[13:28:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:30:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 5%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32741 and previous config saved to /var/cache/conftool/dbconfig/20220822-133021-root.json
[13:30:35] <wikibugs>	 (03CR) 10Hokwelum: "We noticed two files weren’t updated, the production-m3.sql.erb template file, and hieradata/common.yaml file. They both have some referen" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[13:31:33] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[13:31:33] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1002.eqiad.wmnet
[13:31:47] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[13:31:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 10%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32742 and previous config saved to /var/cache/conftool/dbconfig/20220822-133154-root.json
[13:32:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond)
[13:33:27] <wikibugs>	 (03PS2) 10Jbond: admin: test shell dependencies [puppet] - 10https://gerrit.wikimedia.org/r/825347
[13:33:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: O:prometheus: use map instead of reduce (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[13:33:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:34:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36876/console" [puppet] - 10https://gerrit.wikimedia.org/r/825347 (owner: 10Jbond)
[13:35:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[13:35:40] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Infrastructure-Foundations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Ottomata)
[13:37:26] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[13:37:47] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on wdqs[1014-1016].eqiad.wmnet with reason: T314890
[13:37:51] <stashbot>	 T314890: Service implementation for wdqs101[4,5,6] - https://phabricator.wikimedia.org/T314890
[13:37:58] <wikibugs>	 (03PS3) 10Jbond: admin: test shell dependencies [puppet] - 10https://gerrit.wikimedia.org/r/825347
[13:38:00] <wikibugs>	 (03PS1) 10Btullis: Add an extry for dse-k8s-ctrl to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/825348 (https://phabricator.wikimedia.org/T310172)
[13:38:02] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wdqs[1014-1016].eqiad.wmnet with reason: T314890
[13:38:24] <wikibugs>	 (03PS2) 10Btullis: Add an entry for dse-k8s-ctrl to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/825348 (https://phabricator.wikimedia.org/T310172)
[13:38:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36877/console" [puppet] - 10https://gerrit.wikimedia.org/r/825347 (owner: 10Jbond)
[13:39:04] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1002.eqiad.wmnet
[13:39:11] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[13:39:41] <wikibugs>	 10SRE, 10Analytics-Radar, 10Traffic-Icebox, 10Privacy: Add request_id to webrequest logs as well as other event records ingested into Hadoop - https://phabricator.wikimedia.org/T113817 (10Ottomata) a:05Ottomata→03None
[13:41:37] <wikibugs>	 (03PS4) 10Jbond: admin: Correct spelling additional_shells [puppet] - 10https://gerrit.wikimedia.org/r/825347
[13:42:19] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36878/console" [puppet] - 10https://gerrit.wikimedia.org/r/825347 (owner: 10Jbond)
[13:42:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] admin: Correct spelling additional_shells [puppet] - 10https://gerrit.wikimedia.org/r/825347 (owner: 10Jbond)
[13:44:51] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[13:45:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32743 and previous config saved to /var/cache/conftool/dbconfig/20220822-134526-root.json
[13:46:19] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:systemd::timesyncd: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond)
[13:46:40] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:systemd::timesyncd: exclude /mnt from accessible paths (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond)
[13:46:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, thanks Amir for tackling this!" [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup)
[13:46:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 20%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32744 and previous config saved to /var/cache/conftool/dbconfig/20220822-134658-root.json
[13:48:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) (owner: 10Ladsgroup)
[13:48:39] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Log cache read|write attempts [puppet] - 10https://gerrit.wikimedia.org/r/825350
[13:48:47] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs[1014-1016].eqiad.wmnet
[13:48:48] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs[1014-1016].eqiad.wmnet
[13:49:20] <wikibugs>	 (03PS2) 10Muehlenhoff: raid_fact: Add new refactored raid fact [puppet] - 10https://gerrit.wikimedia.org/r/815287 (https://phabricator.wikimedia.org/T313312) (owner: 10Jbond)
[13:50:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) I had a good discussion with @jbond on irc about how we model the host interfaces in Netbox, and I think bas...
[13:53:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: wmflib: introduce pythonloglevel type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[13:53:21] <wikibugs>	 (03PS2) 10Filippo Giunchedi: wmflib: introduce pythonloglevel type [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229)
[13:53:23] <wikibugs>	 (03PS5) 10Filippo Giunchedi: WIP dispatch: add database role [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229)
[13:53:25] <wikibugs>	 (03PS5) 10Filippo Giunchedi: WIP: add profile::dispatch [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229)
[13:59:54] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:00:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32745 and previous config saved to /var/cache/conftool/dbconfig/20220822-140030-root.json
[14:02:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 30%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32746 and previous config saved to /var/cache/conftool/dbconfig/20220822-140203-root.json
[14:07:35] <wikibugs>	 (03PS2) 10Eevans: eevans: replace 2048bit RSA key with new ed25519 one [puppet] - 10https://gerrit.wikimedia.org/r/824776
[14:15:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32747 and previous config saved to /var/cache/conftool/dbconfig/20220822-141535-root.json
[14:16:45] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] eevans: replace 2048bit RSA key with new ed25519 one [puppet] - 10https://gerrit.wikimedia.org/r/824776 (owner: 10Eevans)
[14:17:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 40%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32748 and previous config saved to /var/cache/conftool/dbconfig/20220822-141708-root.json
[14:22:32] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db2142 to x2 master [puppet] - 10https://gerrit.wikimedia.org/r/825354 (https://phabricator.wikimedia.org/T315853)
[14:22:44] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811
[14:22:48] <stashbot>	 T313811: Switchover x2 master db2142 -> db2144 - https://phabricator.wikimedia.org/T313811
[14:22:49] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811
[14:23:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2142 with weight 0 T313811', diff saved to https://phabricator.wikimedia.org/P32749 and previous config saved to /var/cache/conftool/dbconfig/20220822-142312-marostegui.json
[14:23:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:24:04] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1123 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:24:23] <marostegui>	 !log Starting x2 codfw failover from db2144 to db2142 - T315853
[14:24:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:26] <stashbot>	 T315853: reclone x2 codfw hosts - https://phabricator.wikimedia.org/T315853
[14:24:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2142 to x2 master [puppet] - 10https://gerrit.wikimedia.org/r/825354 (https://phabricator.wikimedia.org/T315853) (owner: 10Marostegui)
[14:25:05] <marostegui>	 urandom: Can I merge your puppet changes?
[14:26:58] <urandom>	 marostegui: oh, does that require something other than a merge in gerrit?
[14:27:08] <icinga-wm>	 PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:27:11] <marostegui>	 urandom: Yeah, it needs merging at puppetmaster1001 :)
[14:27:20] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:27:24] <urandom>	 marostegui: haha, ok, yes
[14:27:35] <marostegui>	 urandom: I have merged it now (sudo -i puppet-merge)
[14:27:41] <wikibugs>	 (03PS1) 10Klausman: Add routing for Lift Wing inference models [deployment-charts] - 10https://gerrit.wikimedia.org/r/825356
[14:27:44] <urandom>	 marostegui: thank you
[14:27:50] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:27:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:28:20] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:29:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:30:05] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for junuche, demon, jhuneidi - https://phabricator.wikimedia.org/T315887 (10thcipriani)
[14:30:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32750 and previous config saved to /var/cache/conftool/dbconfig/20220822-143040-root.json
[14:31:38] <wikibugs>	 (03PS2) 10Klausman: Add routing for Lift Wing inference models [deployment-charts] - 10https://gerrit.wikimedia.org/r/825356
[14:32:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 50%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32751 and previous config saved to /var/cache/conftool/dbconfig/20220822-143212-root.json
[14:32:36] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2144 to x2 primary T315853', diff saved to https://phabricator.wikimedia.org/P32752 and previous config saved to /var/cache/conftool/dbconfig/20220822-143243-root.json
[14:32:48] <stashbot>	 T315853: reclone x2 codfw hosts - https://phabricator.wikimedia.org/T315853
[14:33:04] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:34:10] <icinga-wm>	 RECOVERY - Check systemd state on analytics1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:34:56] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1060 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:35:46] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1123 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:16] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:38:23] <moritzm>	 !log draining ganeti2019 for reimage T311686
[14:38:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:27] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[14:40:10] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:45:36] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:45:52] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:46:04] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:46:34] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:47:04] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:47:44] <wikibugs>	 (03PS1) 10Jbond: P:systemd::timesyncd: will need to remove the following fill after merge [puppet] - 10https://gerrit.wikimedia.org/r/825358
[14:48:58] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[14:49:32] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[14:49:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Restore x2 weight', diff saved to https://phabricator.wikimedia.org/P32754 and previous config saved to /var/cache/conftool/dbconfig/20220822-144937-marostegui.json
[14:49:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32755 and previous config saved to /var/cache/conftool/dbconfig/20220822-144943-root.json
[14:49:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 60%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32756 and previous config saved to /var/cache/conftool/dbconfig/20220822-144951-root.json
[14:50:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2144', diff saved to https://phabricator.wikimedia.org/P32757 and previous config saved to /var/cache/conftool/dbconfig/20220822-145040-marostegui.json
[14:51:22] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:51:52] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:54:55] <XioNoX>	 !log drain ulsfo-codfw circuit for Lumen hot cut - T300716
[14:54:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:18] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[14:55:52] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[14:58:06] <wikibugs>	 (03PS1) 10Hashar: puppet_compiler: relocate to /srv/jenkins [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698)
[14:58:21] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar)
[15:01:34] <wikibugs>	 (03CR) 10JMeybohm: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[15:02:19] <wikibugs>	 (03CR) 10DCausse: "looks good, small nit about a comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[15:04:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 75%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32758 and previous config saved to /var/cache/conftool/dbconfig/20220822-150456-root.json
[15:05:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:systemd::timesyncd: will need to remove the following fill after merge [puppet] - 10https://gerrit.wikimedia.org/r/825358 (owner: 10Jbond)
[15:11:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2022/2023-Q1): Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10lmata)
[15:12:15] <icinga-wm>	 PROBLEM - puppet last run on search-loader1001 is CRITICAL: CRITICAL: Puppet last ran 5 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:12:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[15:14:22] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] trafficserver: Log cache read|write attempts [puppet] - 10https://gerrit.wikimedia.org/r/825350 (owner: 10Vgutierrez)
[15:14:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[15:18:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36879/console" [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[15:19:58] <wikibugs>	 (03CR) 10Jbond: "LGTM, FYI im on vacation from wedensday so would be good to do this early tomorrow to make sure we fix any fall out" [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar)
[15:20:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 100%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32759 and previous config saved to /var/cache/conftool/dbconfig/20220822-152000-root.json
[15:20:38] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) All production elastic hosts are on Bullseye now. Closing...
[15:20:52] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) 05Open→03Resolved
[15:20:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10bking)
[15:20:58] <wikibugs>	 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10bking)
[15:21:00] <wikibugs>	 10SRE, 10Discovery-Search: Migrate Elasticsearch to Debian Buster - https://phabricator.wikimedia.org/T244736 (10bking)
[15:21:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] wmflib: introduce pythonloglevel type [puppet] - 10https://gerrit.wikimedia.org/r/825253 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[15:21:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/815287 (https://phabricator.wikimedia.org/T313312) (owner: 10Jbond)
[15:22:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10observability: icinga raid montioring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10RobH)
[15:22:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10observability: icinga raid montioring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10RobH) Thanks for the update!  This was raised as a concern when I handled of dumpsdata1007 for use in service, but noted it didn't yet have accurate raid...
[15:23:02] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10MoritzMuehlenhoff) Very nice!
[15:24:17] <wikibugs>	 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Wikidata, and 3 others: Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10Gehel) This does not seem to be related to Search / WDQS, so I'll untag the Search Platform team. Ping us aga...
[15:26:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Log cache read|write attempts [puppet] - 10https://gerrit.wikimedia.org/r/825350 (owner: 10Vgutierrez)
[15:29:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10observability: icinga raid montioring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10ArielGlenn) I feel a bit queasy about having a server in production without the ability to monitor the raid; what do folks think about this?
[15:30:04] <jouncebot>	 jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T1530).
[15:30:15] <icinga-wm>	 RECOVERY - puppet last run on search-loader1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:30:45] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "trafficserver: Log cache read|write attempts" [puppet] - 10https://gerrit.wikimedia.org/r/825278
[15:30:49] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:34:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "trafficserver: Log cache read|write attempts" [puppet] - 10https://gerrit.wikimedia.org/r/825278 (owner: 10Vgutierrez)
[15:36:06] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "trafficserver: Log cache read|write attempts" [puppet] - 10https://gerrit.wikimedia.org/r/825278
[15:39:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Initially adapt perccli to use the new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/825369
[15:40:07] <wikibugs>	 (03PS1) 10Jbond: C:admin: add support for deprecated groups [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161)
[15:40:25] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "trafficserver: Log cache read|write attempts" [puppet] - 10https://gerrit.wikimedia.org/r/825278 (owner: 10Vgutierrez)
[15:41:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initially adapt perccli to use the new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/825369 (owner: 10Muehlenhoff)
[15:42:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:admin: add support for deprecated groups [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161) (owner: 10Jbond)
[15:44:48] <wikibugs>	 (03PS2) 10Muehlenhoff: Initially adapt perccli to use the new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/825369
[15:45:38] <wikibugs>	 (03PS2) 10Jbond: C:admin: add support for deprecated groups [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161)
[15:47:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:admin: add support for deprecated groups [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161) (owner: 10Jbond)
[15:52:43] <XioNoX>	 !log un-drain ulsfo-codfw circuit for Lumen hot cut - T300716
[15:52:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:14] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Add routing for Lift Wing inference models [deployment-charts] - 10https://gerrit.wikimedia.org/r/825356 (owner: 10Klausman)
[15:56:40] <wikibugs>	 (03PS3) 10Jbond: C:admin: add support for deprecated groups [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161)
[15:56:51] <wikibugs>	 (03PS6) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787
[15:56:53] <wikibugs>	 (03CR) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[15:56:56] <wikibugs>	 (03CR) 10Hashar: puppet_compiler: relocate to /srv/jenkins (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar)
[15:57:08] <wikibugs>	 (03PS2) 10Hashar: puppet_compiler: relocate to /srv/jenkins [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698)
[15:58:27] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar)
[15:58:35] <wikibugs>	 (03Merged) 10jenkins-bot: Add routing for Lift Wing inference models [deployment-charts] - 10https://gerrit.wikimedia.org/r/825356 (owner: 10Klausman)
[16:02:11] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr
[16:05:36] <wikibugs>	 (03PS1) 10Jgreen: Add frdev-new-eqiad.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/825372
[16:10:28] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Log cache read|write attempts on cp6008 and cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/825375
[16:11:58] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] trafficserver: Log cache read|write attempts on cp6008 and cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/825375 (owner: 10Vgutierrez)
[16:12:24] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36880/console" [puppet] - 10https://gerrit.wikimedia.org/r/825375 (owner: 10Vgutierrez)
[16:12:33] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Add frdev-new-eqiad.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/825372 (owner: 10Jgreen)
[16:15:50] <wikibugs>	 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Bonus Level 🕹️): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10hashar) This depends on whether we stick on `git-fat` (in which case we might need to do the porting, and even it is not immediately needed si...
[16:16:09] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Log cache read|write attempts on cp6008 and cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/825375 (owner: 10Vgutierrez)
[16:17:05] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:17:10] <wikibugs>	 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Bonus Level 🕹️): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10demon) a:03demon
[16:17:39] <wikibugs>	 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Bonus Level 🕹️): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10MoritzMuehlenhoff) >>! In T279509#8174957, @hashar wrote: > This depends on whether we stick on `git-fat` (in which case we might need to do t...
[16:18:59] <wikibugs>	 (03CR) 10Hashar: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/1403/console failed cause there are no facts found for th" [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar)
[16:21:43] <wikibugs>	 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Bonus Level 🕹️): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10hashar) > Bullseye doesn't ship Python 2.7 in a supported version, it's only included to _build_ a few packages (e.g. qtwebkit).  **Oops** my...
[16:25:43] <icinga-wm>	 RECOVERY - Check systemd state on an-airflow1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:34:13] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[16:47:44] <wikibugs>	 (03CR) 10Jbond: puppet_compiler: relocate to /srv/jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar)
[16:51:22] <wikibugs>	 (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] scap: add permission mangling, reorder checks [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/822688 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes)
[16:52:08] <wikibugs>	 (03PS5) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196)
[16:52:52] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add a dummy auth_key for the dse_k8s cluster cfssl-issuer [labs/private] - 10https://gerrit.wikimedia.org/r/824725 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[16:57:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10BTullis)
[17:00:05] <jouncebot>	 ryankemper: That opportune time is upon us again. Time for a Wikidata Query Service weekly deploy deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T1700).
[17:03:20] <wikibugs>	 (03CR) 10Dzahn: "@Dduvall Would that work for you? Could you follow Majavah's advice and add it on the local puppetmaster?" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn)
[17:09:26] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "@aqu, I see that you responded to my comments in the latest patches. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu)
[17:18:45] <wikibugs>	 (03PS3) 10Ebernhardson: apifeatureusage: Temporarily remove index template during 6->7 transition [puppet] - 10https://gerrit.wikimedia.org/r/815783 (https://phabricator.wikimedia.org/T313434)
[17:18:51] <wikibugs>	 (03PS3) 10Ebernhardson: apifeatureusage: Drop mapping type from template [puppet] - 10https://gerrit.wikimedia.org/r/815784 (https://phabricator.wikimedia.org/T313434)
[17:22:36] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone: expand the password safelist to specify restricted domains [puppet] - 10https://gerrit.wikimedia.org/r/825380
[17:27:39] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "The idea looks good to me, and the implementation too but I don't really have a way to test it." [puppet] - 10https://gerrit.wikimedia.org/r/825380 (owner: 10Andrew Bogott)
[17:28:09] <wikibugs>	 (03CR) 10Dduvall: Revert "scap: Provide a working SSH key pair for the scap keyholder agent" (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn)
[17:28:30] <wikibugs>	 (03Abandoned) 10Dduvall: Revert "scap: Provide a working SSH key pair for the scap keyholder agent" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn)
[17:29:12] <wikibugs>	 (03CR) 10Majavah: "Um, why was this abandoned?" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn)
[17:29:21] <wikibugs>	 (03Restored) 10Dzahn: Revert "scap: Provide a working SSH key pair for the scap keyholder agent" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn)
[17:29:43] <wikibugs>	 (03CR) 10Dzahn: "thank you, but in this case we should actually merge this, since it was the revert of adding it in labs/private" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn)
[17:30:30] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystone: expand the password safelist to specify restricted domains [puppet] - 10https://gerrit.wikimedia.org/r/825380
[17:38:51] <wikibugs>	 (03PS1) 10Jdlrobson: Layout: Restore disabling of max width on certain pages [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460)
[17:47:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Keystone: expand the password safelist to specify restricted domains [puppet] - 10https://gerrit.wikimedia.org/r/825380 (owner: 10Andrew Bogott)
[17:51:59] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "scap: Provide a working SSH key pair for the scap keyholder agent" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn)
[17:55:40] <wikibugs>	 (03CR) 10Dzahn: "@Hokwelum regarding the prod-m3.sql.erb file. Yea, that needs coordination with DBA (in addition to editing the file in repo). There are e" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[17:59:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Layout: Restore disabling of max width on certain pages [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson)
[18:00:33] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Disable origin coalescing in cp601[56] [puppet] - 10https://gerrit.wikimedia.org/r/825390 (https://phabricator.wikimedia.org/T315911)
[18:02:18] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36881/console" [puppet] - 10https://gerrit.wikimedia.org/r/825390 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[18:06:11] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:07:07] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] trafficserver: Disable origin coalescing in cp601[56] [puppet] - 10https://gerrit.wikimedia.org/r/825390 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[18:08:52] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Disable origin coalescing in cp601[56] [puppet] - 10https://gerrit.wikimedia.org/r/825390 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[18:12:02] <vgutierrez>	 !log disable origin coalescing in ats@cp601[56] - T315911
[18:12:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:06] <stashbot>	 T315911: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911
[18:14:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2022/2023-Q1): Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) Fixed in the following patches:  1. [[ https://gerrit.wikimedia.org/r/822196 | #822196 - netmon: Create...
[18:14:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2022/2023-Q1): Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) 05Open→03Resolved
[18:16:23] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Papaul) @Joe can you please provide me with the partman recipe to use for those servers.The description says only Raid1 .   thanks
[18:16:34] <wikibugs>	 (03PS3) 10Htriedman: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[18:17:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[18:18:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mc-wf2001
[18:18:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-wf2001
[18:19:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mc-wf2002
[18:19:06] <wikibugs>	 (03Abandoned) 10Andrew Bogott: keystone: add restrict_password_auth flag [puppet] - 10https://gerrit.wikimedia.org/r/824830 (https://phabricator.wikimedia.org/T294195) (owner: 10Andrew Bogott)
[18:19:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack codfw1dev to version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824886 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott)
[18:19:35] <wikibugs>	 (03CR) 10Htriedman: "incorporated proposed edits from BBlack" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[18:19:36] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-wf2002
[18:21:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[18:21:43] <wikibugs>	 (03CR) 10Bernard Wang: "recheck" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson)
[18:23:43] <wikibugs>	 10SRE, 10Observability-Logging, 10Observability-Metrics, 10Performance-Team (Radar): Framework for running experiments on a subset of the app server fleet - https://phabricator.wikimedia.org/T315403 (10Krinkle)
[18:25:28] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Papaul)
[18:26:15] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:26:23] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:28:39] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.219 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:28:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc-wf2001.mgmt.codfw.wmnet with reboot policy FORCED
[18:30:23] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Trove: remove some file resources no longer needed in X [puppet] - 10https://gerrit.wikimedia.org/r/825394 (https://phabricator.wikimedia.org/T296561)
[18:31:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Trove: remove some file resources no longer needed in X [puppet] - 10https://gerrit.wikimedia.org/r/825394 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott)
[18:34:47] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:35:28] <wikibugs>	 10SRE, 10Data-Services, 10Discovery-Search, 10Wikidata, and 3 others: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Gehel) a:05Gehel→03None
[18:47:56] <wikibugs>	 (03PS4) 10Andrea Denisse: netmon: Create LibreNMS logs file. [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393)
[18:50:55] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:53:11] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-wf2001.mgmt.codfw.wmnet with reboot policy FORCED
[18:53:54] <wikibugs>	 (03PS2) 10Dzahn: phabricator: add phab1004 to list of phab hosts for firewall [puppet] - 10https://gerrit.wikimedia.org/r/824802 (https://phabricator.wikimedia.org/T280597)
[18:54:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc-wf2002.mgmt.codfw.wmnet with reboot policy FORCED
[18:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:59:24] <logmsgbot>	 !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@5ac442f]: Use instance specific HDFS cache on platform_eng
[18:59:34] <logmsgbot>	 !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@5ac442f]: Use instance specific HDFS cache on platform_eng (duration: 00m 10s)
[19:00:05] <wikibugs>	 (03PS1) 10Jgreen: Remove frlog1001 from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/825395 (https://phabricator.wikimedia.org/T312581)
[19:00:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "allowing connections from/to phab1004 and other phab hosts" [puppet] - 10https://gerrit.wikimedia.org/r/824802 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[19:02:09] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Remove frlog1001 from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/825395 (https://phabricator.wikimedia.org/T312581) (owner: 10Jgreen)
[19:02:59] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:04:15] <logmsgbot>	 !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics_test@9edd1ab]: Use instance specific HDFS cache on analytics_test
[19:04:21] <logmsgbot>	 !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics_test@9edd1ab]: Use instance specific HDFS cache on analytics_test (duration: 00m 05s)
[19:09:40] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jgreen)
[19:11:09] <logmsgbot>	 !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics_test@5ac442f]: Use instance specific HDFS cache on analytics_test
[19:11:27] <logmsgbot>	 !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics_test@5ac442f]: Use instance specific HDFS cache on analytics_test (duration: 00m 17s)
[19:12:43] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-wf2002.mgmt.codfw.wmnet with reboot policy FORCED
[19:16:55] <wikibugs>	 10SRE, 10Wikimedia-GitHub: stop syncing and delete labs/private repo from github - https://phabricator.wikimedia.org/T315925 (10Dzahn)
[19:20:06] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-wf2001']
[19:27:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-wf2001']
[19:27:21] <wikibugs>	 (03CR) 10Bking: [C: 03+2] bullseye: apt component update [puppet] - 10https://gerrit.wikimedia.org/r/824791 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking)
[19:28:00] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-wf2002']
[19:30:12] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:32:18] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:35:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-wf2002']
[19:46:02] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:46:49] <wikibugs>	 (03PS1) 10Papaul: Add mc-wf200[12] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/825398 (https://phabricator.wikimedia.org/T313966)
[19:48:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Move cloudbackup100[12]-dev to Xena [puppet] - 10https://gerrit.wikimedia.org/r/825399 (https://phabricator.wikimedia.org/T296561)
[19:49:19] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add mc-wf200[12] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/825398 (https://phabricator.wikimedia.org/T313966) (owner: 10Papaul)
[19:49:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Move cloudbackup100[12]-dev to Xena [puppet] - 10https://gerrit.wikimedia.org/r/825399 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott)
[19:50:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc-wf2001.codfw.wmnet with OS bullseye
[19:50:21] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc-wf2001.codfw.wmnet with OS bullseye
[19:52:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @Marostegui Chris is out on vacation I will take a look later to see.
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T2000).
[20:00:04] <jouncebot>	 bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:46] * urbanecm waves
[20:01:02] <urbanecm>	 I can deploy today
[20:01:21] <urbanecm>	 bwang: hi, are you around?
[20:01:34] <bwang>	 yes!
[20:01:46] <urbanecm>	 Great!
[20:02:17] <urbanecm>	 bwang: your patch seems to fail CI. Why is that happening, please?
[20:03:09] <bwang>	 ah sorry, ill check now
[20:03:38] <urbanecm>	 thanks
[20:03:54] <logmsgbot>	 !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@5ac442f]: Use instance specific HDFS cache on analytics
[20:03:56] <wikibugs>	 (03CR) 10Bernard Wang: "recheck" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson)
[20:04:07] <bwang>	 hm odd, the errors dont seem related to the patch
[20:04:34] <urbanecm>	 ah, should've looked at the errors first. seems to be T315892, which is now fixed
[20:04:34] <logmsgbot>	 !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@5ac442f]: Use instance specific HDFS cache on analytics (duration: 00m 40s)
[20:04:39] <stashbot>	 T315892: PHPUnit\Framework\Exception: This test uses TestCase::prophesize(), but phpspec/prophecy is not installed. - https://phabricator.wikimedia.org/T315892
[20:04:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Layout: Restore disabling of max width on certain pages [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson)
[20:04:54] <urbanecm>	 +2'ed and let's hope :)
[20:07:22] <wikibugs>	 (03CR) 10Cwhite: "Some suggestions inline to make the query a bit more efficient and exclude possibly changing hostnames." [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) (owner: 10Ladsgroup)
[20:08:21] <bwang>	 sounds good!
[20:09:50] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf2001.codfw.wmnet with reason: host reimage
[20:13:22] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf2001.codfw.wmnet with reason: host reimage
[20:13:50] <urbanecm>	 fails again, but that's because the fix's not in wmf.25. meh.
[20:14:26] <wikibugs>	 (03PS1) 10Urbanecm: composer.json: Pin phpunit to 8.5.28 [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825281 (https://phabricator.wikimedia.org/T315892)
[20:14:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "CI issues during backporting" [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825281 (https://phabricator.wikimedia.org/T315892) (owner: 10Urbanecm)
[20:14:59] <wikibugs>	 (03PS2) 10Urbanecm: Layout: Restore disabling of max width on certain pages [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson)
[20:15:05] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Layout: Restore disabling of max width on certain pages [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson)
[20:15:17] <urbanecm>	 bwang: trying again, this time with a proper depends-on.
[20:15:31] <bwang>	 👍
[20:18:19] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10nskaggs) @Kelson Can you clarify how much additional space would be needed now? I saw the description...
[20:20:39] <wikibugs>	 (03CR) 10Andrea Denisse: "This patch sets up the correct directory to gather LibreNMS logs with logrotate." [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse)
[20:21:24] <wikibugs>	 (03CR) 10Andrea Denisse: netmon: Configure Logrotate for LibreNMS logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse)
[20:25:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36885/" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[20:27:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "It only affects vrts1001, i'll merge it." [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[20:28:14] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf2001.codfw.wmnet with OS bullseye
[20:28:22] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc-wf2001.codfw.wmnet with OS bullseye completed: -...
[20:32:22] <wikibugs>	 (03Merged) 10jenkins-bot: composer.json: Pin phpunit to 8.5.28 [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825281 (https://phabricator.wikimedia.org/T315892) (owner: 10Urbanecm)
[20:33:14] <wikibugs>	 (03Merged) 10jenkins-bot: Layout: Restore disabling of max width on certain pages [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825280 (https://phabricator.wikimedia.org/T315460) (owner: 10Jdlrobson)
[20:33:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc-wf2002.codfw.wmnet with OS bullseye
[20:34:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc-wf2002.codfw.wmnet with OS bullseye
[20:35:58] <wikibugs>	 (03PS1) 10Ori: Incremental roll-out of query-sorting (5%) [puppet] - 10https://gerrit.wikimedia.org/r/825404 (https://phabricator.wikimedia.org/T314868)
[20:36:33] <wikibugs>	 (03PS2) 10Ori: Incremental roll-out of query-sorting (5%) [puppet] - 10https://gerrit.wikimedia.org/r/825404 (https://phabricator.wikimedia.org/T314868)
[20:37:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:38:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:38:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:39:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:39:30] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10Jclark-ctr) replaced qsfp in e1 port 54
[20:44:25] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10cmooney) Thanks @Jclark-ctr that seems to have done it:  ` cmooney@lsw1-e1-eqiad> show interfaces et-0/0/54     Aug 22 20:37:45 Physical interface: et-0/0...
[20:44:52] <urbanecm>	 bwang: sorry, missed the patches already merged.
[20:45:08] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f3-eqiad down - https://phabricator.wikimedia.org/T315052 (10cmooney) 05Open→03Resolved
[20:45:11] <bwang>	 ok! where should i test it?
[20:45:23] <urbanecm>	 pulling to test srv now
[20:45:57] <urbanecm>	 bwang: pulled to mwdebug1001 now, please test it there :)
[20:46:39] <bwang>	 looks good!
[20:47:19] <urbanecm>	 thanks, syncing!
[20:51:36] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.25/skins/Vector/: e0ff7634ac529acec6d298992b45b23203b682c1: Layout: Restore disabling of max width on certain pages (T315460) (duration: 03m 37s)
[20:51:41] <stashbot>	 T315460: [Regression] Pages which are supposed to have full-width no longer have full-width layout - https://phabricator.wikimedia.org/T315460
[20:51:47] <urbanecm>	 bwang: and should be live. thanks for your patience :).
[20:52:05] <bwang>	 thank you!
[20:52:53] <wikibugs>	 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Running docker containers in a non-production environment - https://phabricator.wikimedia.org/T275551 (10fkaelin) Reviving this discussion, though I renamed the phab to "Running docker containers in a non-production environment", as the issue boils down to t...
[20:53:18] <wikibugs>	 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Running docker containers in a non-production environment - https://phabricator.wikimedia.org/T275551 (10fkaelin)
[20:53:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf2002.codfw.wmnet with reason: host reimage
[20:56:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf2002.codfw.wmnet with reason: host reimage
[20:58:52] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1195.eqiad.wmnet with OS bullseye
[20:58:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1195.eqiad.wmnet with OS bullseye
[20:59:09] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1195.eqiad.wmnet with OS bullseye
[20:59:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1195.eqiad.wmnet with OS bullseye executed with er...
[20:59:31] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye
[20:59:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1185.eqiad.wmnet with OS bullseye
[20:59:54] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1185.eqiad.wmnet with OS bullseye
[20:59:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1185.eqiad.wmnet with OS bullseye executed with er...
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220822T2100).
[21:00:49] <sbassett>	 ^ At least one going out soon for T310763...
[21:01:37] <AnaisGueyte>	 Yes! I'm here for T310763!
[21:01:55] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye
[21:02:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1185.eqiad.wmnet with OS bullseye
[21:02:04] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1185.eqiad.wmnet with OS bullseye
[21:02:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1185.eqiad.wmnet with OS bullseye executed with er...
[21:03:51] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:04:22] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.dhcp for host db1185.eqiad.wmnet
[21:06:40] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host db1185.eqiad.wmnet
[21:11:54] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf2002.codfw.wmnet with OS bullseye
[21:11:58] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc-wf2002.codfw.wmnet with OS bullseye completed: - mc-wf2002 (**PASS**)...
[21:13:39] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Papaul)
[21:14:05] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Papaul) 05Open→03Resolved complete
[21:17:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604
[21:17:19] <stashbot>	 T315604: Upgrade relforge cluster to 7.10.2 - https://phabricator.wikimedia.org/T315604
[21:17:38] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604
[21:19:17] <wikibugs>	 10SRE, 10Projects-Cleanup, 10fixcopyright.wikimedia.org, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) 05Open→03Resolved I believe this has been complete for some time.
[21:26:35] <sbassett>	 !log Deployed security fix for T310763
[21:26:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:50] <AnaisGueyte>	 Confirming it's done by testing
[21:30:26] <wikibugs>	 10SRE, 10Projects-Cleanup, 10fixcopyright.wikimedia.org, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Dzahn) The ticket was open because not all checkboxes were checked yet. One is the removal of the tag in Phabricator.
[21:30:30] <sbassett>	 Thanks, AnaisGuetyte!
[21:30:44] <sbassett>	 Or AnaisGueyte, rather.
[21:30:55] <AnaisGueyte>	 All good :)
[21:31:36] <wikibugs>	 (03PS1) 10Bking: apt: changes to pull in latest elastic version [puppet] - 10https://gerrit.wikimedia.org/r/825413 (https://phabricator.wikimedia.org/T315604)
[21:31:39] <wikibugs>	 10SRE, 10Projects-Cleanup, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Dzahn) I clicked "archive project" on https://phabricator.wikimedia.org/tag/fixcopyright.wikimedia.org/
[21:31:56] <wikibugs>	 10SRE, 10Projects-Cleanup, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Dzahn)
[21:32:44] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] apt: changes to pull in latest elastic version [puppet] - 10https://gerrit.wikimedia.org/r/825413 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking)
[21:35:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "File[/etc/cron.daily/spamassassin]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[21:41:39] <wikibugs>	 (03CR) 10Bking: [C: 03+2] apt: changes to pull in latest elastic version [puppet] - 10https://gerrit.wikimedia.org/r/825413 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking)
[21:45:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604
[21:46:01] <stashbot>	 T315604: Upgrade relforge cluster to 7.10.2 - https://phabricator.wikimedia.org/T315604
[21:46:08] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604
[21:53:16] <icinga-wm>	 PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: spamassassin_updates.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:55:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604
[21:55:47] <stashbot>	 T315604: Upgrade relforge cluster to 7.10.2 - https://phabricator.wikimedia.org/T315604
[21:56:16] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604
[22:02:28] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:02:46] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for junuche, demon, jhuneidi - https://phabricator.wikimedia.org/T315887 (10Dzahn) also see T273164
[22:09:59] <wikibugs>	 (03PS1) 10Dzahn: spamassassin: fix spamassassin_updates script name in timer [puppet] - 10https://gerrit.wikimedia.org/r/825416 (https://phabricator.wikimedia.org/T273673)
[22:12:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/825416" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[22:13:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Aug 22 21:33:55 otrs1001 systemd[21872]: spamassassin_updates.service: Failed at step EXEC spawning /usr/local/sbin/spamassassin_updates: " [puppet] - 10https://gerrit.wikimedia.org/r/825416 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[22:15:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "ran manual rm /usr/local/sbin/spamassassin_timer.sh and systemctl start spamassassin_updates.service" [puppet] - 10https://gerrit.wikimedia.org/r/825416 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[22:15:56] <icinga-wm>	 RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:17:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "<+icinga-wm> RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.or" [puppet] - 10https://gerrit.wikimedia.org/r/825416 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[22:19:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] netmon: Configure Logrotate for LibreNMS logs [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse)
[22:20:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] vrts: Always install the latest version of libdatetime-timezone-perl [puppet] - 10https://gerrit.wikimedia.org/r/825333 (owner: 10Muehlenhoff)
[22:20:58] <wikibugs>	 (03PS2) 10Dzahn: vrts: Always install the latest version of libdatetime-timezone-perl [puppet] - 10https://gerrit.wikimedia.org/r/825333 (owner: 10Muehlenhoff)
[22:24:32] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:26:28] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:27:26] <wikibugs>	 (03CR) 10Dzahn: "yep, thanks. this was noop in prod" [puppet] - 10https://gerrit.wikimedia.org/r/825333 (owner: 10Muehlenhoff)
[22:30:58] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:31:18] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:48:54] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:49:56] <icinga-wm>	 PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: spamassassin_updates.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:56:44] <icinga-wm>	 RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:58:53] <wikibugs>	 (03CR) 10Dzahn: "thanks for fixing it! I don't know how but it works now. probably because now the user phd was already created" [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond)
[22:59:00] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/824039 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling)
[22:59:33] <wikibugs>	 (03CR) 10Dzahn: "Thank you, I guess you could not reproduce because now the phd user was already created by user{}. I'll see what happens on phab1004." [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond)
[23:00:38] <wikibugs>	 (03PS1) 10Zabe: Run the initsitestats period job on a daily basis [puppet] - 10https://gerrit.wikimedia.org/r/825424 (https://phabricator.wikimedia.org/T315121)
[23:02:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "thanks! I can confirm I ran this manually before and it did not take long and also it's a real issue that creates IRC pings" [puppet] - 10https://gerrit.wikimedia.org/r/825424 (https://phabricator.wikimedia.org/T315121) (owner: 10Zabe)
[23:04:25] <TimStarling>	 !log Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org
[23:04:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:34] <wikibugs>	 (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/36886/" [puppet] - 10https://gerrit.wikimedia.org/r/825424 (https://phabricator.wikimedia.org/T315121) (owner: 10Zabe)
[23:08:29] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) The rollout was reverted back to stage 0 on August 15 due to T315271. I just reverted the revert, so it will be running on mediawiki.org once the...
[23:10:45] <logmsgbot>	 !log tstarling@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=(appservers|api)-ro
[23:11:22] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle)
[23:18:46] <wikibugs>	 (03PS2) 10RLazarus: Run the initsitestats period job on a daily basis [puppet] - 10https://gerrit.wikimedia.org/r/825424 (https://phabricator.wikimedia.org/T315121) (owner: 10Zabe)
[23:19:22] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "Thanks Zabe! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/825424 (https://phabricator.wikimedia.org/T315121) (owner: 10Zabe)
[23:37:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @Jclark-ctr whne you back on site can you please check the cable on db1185 looks like the cable is not connected.  ` papaul@asw2-a-eqiad>...
[23:39:51] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1187.eqiad.wmnet with OS bullseye
[23:39:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1187.eqiad.wmnet with OS bullseye
[23:50:08] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:52:32] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage
[23:55:11] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage