[00:08:09] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:13:59] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:20:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:22:05] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:25:23] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:31:00] <wikibugs>	 (03PS1) 10Tim Starling: Microsecond timestamp resolution in UDP logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820904
[00:31:39] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:35:37] <wikibugs>	 (03CR) 10Tim Starling: "I tested it in production by making this change on mwdebug2001 and then sending a request to it with X-Wikimedia-Debug." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820904 (owner: 10Tim Starling)
[00:39:19] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:05] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:51] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:53:43] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:12:55] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:24:55] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:27:18] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I tried a warmup request followed by another request for the same page view, the second having MW logging enabled with [[https://gerrit.wikimedia....
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:44:07] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:46:47] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:56:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:56:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[01:59:47] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[02:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220807T0700)
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T0200)
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:53] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:09:16] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I tested parse times with `ab -n10 -H'X-Forwarded-Proto: https' -X mw1441.eqiad.wmnet:80 'http://test2.wikipedia.org/w/api.php?action=parse&format...
[02:10:21] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling)
[02:15:17] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:17:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:27:15] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:46:27] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:56:35] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:58:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:04:51] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:17:35] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:29:35] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:48:05] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:51:37] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[03:58:49] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 8 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[04:02:29] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:21:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:21:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:33:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:52:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:58:39] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:04:49] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:10:50] <wikibugs>	 (03PS2) 10KartikMistry: Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820261 (https://phabricator.wikimedia.org/T308829)
[05:15:24] <wikibugs>	 (03CR) 10Tim Starling: "So how's it looking? Was the test successful?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/146849 (owner: 10Bsitu)
[05:23:59] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:29:08] <wikibugs>	 10SRE, 10Performance-Team, 10serviceops: Clean up testwiki experiments - https://phabricator.wikimedia.org/T314750 (10tstarling)
[05:34:13] <wikibugs>	 (03PS1) 10Tim Starling: Remove abandoned Echo job queue test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821041 (https://phabricator.wikimedia.org/T314750)
[05:35:57] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:44:13] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:44:35] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:50:26] <wikibugs>	 (03PS1) 10Tim Starling: Remove testwiki example.org link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821042 (https://phabricator.wikimedia.org/T314750)
[05:50:28] <wikibugs>	 (03PS1) 10Tim Starling: Remove wgVectorResponsive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821043 (https://phabricator.wikimedia.org/T314750)
[05:50:30] <wikibugs>	 (03PS1) 10Tim Starling: Remove override for wgRevisionCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821044 (https://phabricator.wikimedia.org/T314750)
[05:55:07] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:56:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[06:00:47] <wikibugs>	 (03PS1) 10Tim Starling: Remove testwiki wgTorTagChanges override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821045 (https://phabricator.wikimedia.org/T314750)
[06:00:49] <wikibugs>	 (03PS1) 10Tim Starling: Remove testwiki live preview demo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821046 (https://phabricator.wikimedia.org/T314750)
[06:07:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:26:23] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:59] <wikibugs>	 (03PS1) 10Tim Starling: Remove unnecessary override for wmgUseCLDR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821047 (https://phabricator.wikimedia.org/T314750)
[06:31:01] <wikibugs>	 (03PS1) 10Tim Starling: Remove wmgDisplayFeedsInSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821048 (https://phabricator.wikimedia.org/T314750)
[06:31:03] <wikibugs>	 (03PS1) 10Tim Starling: Remove wmgUseWikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821049 (https://phabricator.wikimedia.org/T310365)
[06:36:32] <wikibugs>	 10SRE, 10serviceops: eqiad (2) memcached host for wikifunctions service implementation tracking - https://phabricator.wikimedia.org/T313965 (10Joe)
[06:38:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:41:15] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:42:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Joe) a:05Joe→03None
[06:42:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Joe) @RobH all info should be filled in now.
[06:43:40] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Joe) a:05Joe→03None
[06:44:05] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Joe) a:03Papaul @RobH the task should be complete with all the info, reassigning to Papaul
[06:44:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Joe) a:03Jclark-ctr
[06:45:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Joe)
[06:45:30] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Joe)
[06:57:31] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:58:23] <icinga-wm>	 PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T0700).
[07:00:04] <jouncebot>	 kart_ and koi: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:15] * kart_ is here.
[07:00:24] <kart_>	 and will self-deploy..
[07:00:42] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820261 (https://phabricator.wikimedia.org/T308829) (owner: 10KartikMistry)
[07:00:54] <koi>	 hi kart_, would you like to also deploy my patch :)
[07:01:04] <kart_>	 koi: sure! Let me check.
[07:02:58] <wikibugs>	 (03Merged) 10jenkins-bot: Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820261 (https://phabricator.wikimedia.org/T308829) (owner: 10KartikMistry)
[07:04:37] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Netbox: add CSP headers [puppet] - 10https://gerrit.wikimedia.org/r/820645 (https://phabricator.wikimedia.org/T296356) (owner: 10Ayounsi)
[07:05:37] <elukey>	 !log restart rsyslog on ml-serve-ctrl2001
[07:05:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:06:44] <XioNoX>	 !log add CSP headers to Netbox - T296356
[07:06:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:09] <kart_>	 koi: Deploying my patch. I'll ping when your patch is ready for testing..
[07:09:29] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:09:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:09:58] <logmsgbot>	 !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820261|Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default (T308829)]] (duration: 03m 15s)
[07:10:00] <stashbot>	 T308829: Enable Section Translation on 10 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T308829
[07:10:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:10:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:10:59] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:11:14] <wikibugs>	 (03PS2) 10KartikMistry: trwikivoyage: Create rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820815 (https://phabricator.wikimedia.org/T314678) (owner: 10Stang)
[07:11:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312863)', diff saved to https://phabricator.wikimedia.org/P32306 and previous config saved to /var/cache/conftool/dbconfig/20220808-071144-ladsgroup.json
[07:11:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:11:50] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[07:11:54] <elukey>	 !log restart rsyslog on ml-serve2007
[07:11:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:46] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] "UTC Morning Config Deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820815 (https://phabricator.wikimedia.org/T314678) (owner: 10Stang)
[07:14:48] <wikibugs>	 (03Merged) 10jenkins-bot: trwikivoyage: Create rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820815 (https://phabricator.wikimedia.org/T314678) (owner: 10Stang)
[07:15:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:16:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:17:05] <kart_>	 koi: Please test patch on mwdebug1001
[07:17:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:17:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:18:45] <koi>	 kart_: tested and LGTM
[07:18:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:19:09] <kart_>	 koi: cool. Deploying..
[07:19:47] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456)
[07:22:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira)
[07:22:46] <logmsgbot>	 !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820815|trwikivoyage: Create rollbacker user group (T314678)]] (duration: 03m 17s)
[07:22:48] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox: add hourly postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/820463 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi)
[07:22:49] <stashbot>	 T314678: Add rollbacker user group to trwikivoyage - https://phabricator.wikimedia.org/T314678
[07:23:23] <kart_>	 koi: Done.
[07:23:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:23:57] <koi>	 thanks a lot!
[07:24:17] <kart_>	 I need to go for quick lunch + meeting now. If any other deployers available, config deployment window has still approx 35 minutes left..
[07:25:29] <kart_>	 Oh, I've patch, but I forgot to even submit it. Tomorrow maybe!
[07:50:15] <wikibugs>	 (03PS1) 10Ayounsi: Netbox backup: only run on the primary node [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677)
[07:50:26] <godog>	 !log grow sda/sdb 3 by 100G on thanos-be1004 - T314275
[07:50:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:29] <stashbot>	 T314275: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275
[07:53:39] <icinga-wm>	 PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:53:46] <godog>	 !log grow sda/sdb 3 by 100G on thanos-be2001 - T314275
[07:53:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312863)', diff saved to https://phabricator.wikimedia.org/P32309 and previous config saved to /var/cache/conftool/dbconfig/20220808-075702-ladsgroup.json
[07:57:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[07:57:06] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[07:57:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[07:57:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32310 and previous config saved to /var/cache/conftool/dbconfig/20220808-075723-ladsgroup.json
[07:57:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:59:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: set minimum 200G for swift sd[ab]3 [puppet] - 10https://gerrit.wikimedia.org/r/821174 (https://phabricator.wikimedia.org/T314275)
[08:00:57] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:01:11] <wikibugs>	 (03PS1) 10Ayounsi: Postgres dumps: add hour and minute to filename [puppet] - 10https://gerrit.wikimedia.org/r/821175 (https://phabricator.wikimedia.org/T262677)
[08:03:03] <wikibugs>	 (03PS2) 10Ayounsi: Postgres dumps: add hour and minute to filename [puppet] - 10https://gerrit.wikimedia.org/r/821175 (https://phabricator.wikimedia.org/T262677)
[08:06:55] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "Self merging as well as it seems low risk." [puppet] - 10https://gerrit.wikimedia.org/r/821175 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi)
[08:09:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you! My bad re: syntax" [puppet] - 10https://gerrit.wikimedia.org/r/820800 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis)
[08:09:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:25:52] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: remove CSV dump directory and time [puppet] - 10https://gerrit.wikimedia.org/r/821177 (https://phabricator.wikimedia.org/T310615)
[08:25:54] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: remove Puppet config related to CSV dumps [puppet] - 10https://gerrit.wikimedia.org/r/821178 (https://phabricator.wikimedia.org/T310615)
[08:26:49] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: some yaml autoformatting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816005 (owner: 10David Caro)
[08:28:15] <wikibugs>	 (03PS4) 10David Caro: ceph:osd: add support for multi-network setup [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209)
[08:28:39] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:29:40] <wikibugs>	 (03PS1) 10Ayounsi: Remove CSV dump scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821180 (https://phabricator.wikimedia.org/T310615)
[08:32:21] <wikibugs>	 (03PS2) 10Ayounsi: Netbox: remove CSV dump directory and timer [puppet] - 10https://gerrit.wikimedia.org/r/821177 (https://phabricator.wikimedia.org/T310615)
[08:34:42] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36636/" [puppet] - 10https://gerrit.wikimedia.org/r/821177 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi)
[08:35:05] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:36:02] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36637/" [puppet] - 10https://gerrit.wikimedia.org/r/821178 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi)
[08:38:48] <wikibugs>	 (03PS1) 10Jcrespo: Revert "dbbackups: Move s4 eqiad snapshots from db1150 to db1145" [puppet] - 10https://gerrit.wikimedia.org/r/820868
[08:39:03] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:41:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Revert "dbbackups: Move s4 eqiad snapshots from db1150 to db1145" [puppet] - 10https://gerrit.wikimedia.org/r/820868 (owner: 10Jcrespo)
[08:41:35] <jbond>	 !log deploy libtirpc update
[08:41:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:51] <wikibugs>	 (03CR) 10Ladsgroup: "I think we need to keep db1132 (10.6) as we are doing a lot of experiments on it, some that's making it go down sometimes: T311106. The re" [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T313569) (owner: 10Jcrespo)
[08:46:17] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:47:43] <wikibugs>	 (03CR) 10Jcrespo: mariadb: Revert a few leftover disabled notif., belived to be wrong (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T313569) (owner: 10Jcrespo)
[08:48:01] <wikibugs>	 (03PS1) 10David Caro: wmcs: autoformat our yaml files [puppet] - 10https://gerrit.wikimedia.org/r/821181
[08:48:39] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:49:31] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: some yaml autoformatting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816005 (owner: 10David Caro)
[08:49:37] <wikibugs>	 (03CR) 10David Caro: "From https://gerrit.wikimedia.org/r/c/operations/puppet/+/816005/2#message-3e6bf118fb0f5d1c209fb907494f3c8b3cff88b8" [puppet] - 10https://gerrit.wikimedia.org/r/821181 (owner: 10David Caro)
[08:53:13] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira)
[08:54:59] <icinga-wm>	 RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:57:51] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:01:03] <icinga-wm>	 RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:01:08] <wikibugs>	 (03CR) 10Ladsgroup: mariadb: Revert a few leftover disabled notif., belived to be wrong (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T313569) (owner: 10Jcrespo)
[09:02:11] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:03:54] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Remove wmgUseWikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821049 (https://phabricator.wikimedia.org/T310365) (owner: 10Tim Starling)
[09:04:51] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:05:36] <wikibugs>	 (03CR) 10Elukey: "Hey Kevin! The change looks good, but for staging we'd probably need to keep the number of pods low, so probably only enwiki is enough. If" [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira)
[09:09:10] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Potentially related to T219592 as well. Either we can remove a bunch of code in Echo, or it's unfinished/abandoned solution to T219592 for" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821041 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[09:09:14] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Remove testwiki example.org link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821042 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[09:14:11] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:18:17] <wikibugs>	 10SRE, 10Data Engineering Planning: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10BTullis) Yes I am still interested. Adding it to our planning board for discussion.
[09:18:47] <wikibugs>	 (03PS1) 10Btullis: Replace underscores with hyphens in dse-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/821186 (https://phabricator.wikimedia.org/T313129)
[09:19:37] <wikibugs>	 (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche)
[09:19:43] <wikibugs>	 (03CR) 10Jaime Nuche: "Tested in beta" [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche)
[09:20:40] <wikibugs>	 (03PS2) 10Ayounsi: Netbox DB dump, hourly on secondary, daily on primary [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677)
[09:20:51] <wikibugs>	 (03PS6) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129)
[09:23:13] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36638/console" [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[09:23:41] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Remove wgVectorResponsive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821043 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[09:23:44] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Remove override for wgRevisionCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821044 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[09:23:50] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Remove testwiki wgTorTagChanges override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821045 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[09:23:58] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Remove testwiki live preview demo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821046 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[09:24:07] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Remove unnecessary override for wmgUseCLDR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821047 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[09:24:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Replace underscores with hyphens in dse-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/821186 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[09:24:39] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:25:10] <wikibugs>	 (03PS7) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129)
[09:25:20] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/36639/" [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi)
[09:25:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Replace underscores with hyphens in dse-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/821186 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[09:26:10] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36640/console" [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[09:26:21] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Also matches the FeatureFeeds' extension default. I think we mostly try to move declarations to IS.php away from CS.php statements (part o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821048 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[09:30:46] <wikibugs>	 (03PS8) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129)
[09:32:04] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: move db::dump_interval for profile default [puppet] - 10https://gerrit.wikimedia.org/r/821191 (https://phabricator.wikimedia.org/T262677)
[09:33:00] <wikibugs>	 (03PS2) 10Ayounsi: Netbox: move db::dump_interval to profile default [puppet] - 10https://gerrit.wikimedia.org/r/821191 (https://phabricator.wikimedia.org/T262677)
[09:33:21] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:38:54] <wikibugs>	 (03CR) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[09:38:56] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] scap: update venv to use the system ca bundle [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond)
[09:42:57] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "dbbackups: Move s4 eqiad snapshots from db1150 to db1145" [puppet] - 10https://gerrit.wikimedia.org/r/820868 (owner: 10Jcrespo)
[09:49:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/820813 (https://phabricator.wikimedia.org/T311048) (owner: 10Ayounsi)
[09:52:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM assuming the authorisation comes" [puppet] - 10https://gerrit.wikimedia.org/r/820285 (https://phabricator.wikimedia.org/T292955) (owner: 10RhinosF1)
[09:54:09] <wikibugs>	 (03PS3) 10RhinosF1: admin: update ssh key for mnz [puppet] - 10https://gerrit.wikimedia.org/r/820285 (https://phabricator.wikimedia.org/T292955)
[09:54:34] <RhinosF1>	 jbond: ty, i believe mortiz was doing the checks
[09:56:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:57:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:58:05] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Netbox: add housekeeping systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/820813 (https://phabricator.wikimedia.org/T311048) (owner: 10Ayounsi)
[09:59:29] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:08:24] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:10:14] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:11:37] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10SecTeam-Processed: Add Larissa Gaulia to #mediawiki_security - https://phabricator.wikimedia.org/T314616 (10larissagaulia) Thank you all
[10:15:30] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:17:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
[10:18:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
[10:18:58] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:19:04] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:22:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ayounsi) p:05Medium→03High ` Aug  8 06:10:32  mr1-eqiad /kernel: KERN_ARP_ADDR_CHANGE: arp info overwritten for 10.65.2.255 from d0:8e:79:f4:1...
[10:23:07] <wikibugs>	 (03CR) 10Jbond: "cr looks good to me but its not clear from the commit why the splay is needed" [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi)
[10:23:57] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Revert a few leftover disabled notif., belived to be wrong [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T311106)
[10:24:06] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Decommission db2079 [puppet] - 10https://gerrit.wikimedia.org/r/821198 (https://phabricator.wikimedia.org/T313885)
[10:25:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/821177 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi)
[10:25:42] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:25:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2079.codfw.wmnet
[10:26:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, of course will need the previous change to apply before merging this" [puppet] - 10https://gerrit.wikimedia.org/r/821178 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi)
[10:26:18] <wikibugs>	 (03CR) 10Jcrespo: mariadb: Revert a few leftover disabled notif., belived to be wrong (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T311106) (owner: 10Jcrespo)
[10:26:27] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] kiwix: create dest dir before rsyncing if it does not exist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814707 (owner: 10David Caro)
[10:26:45] <wikibugs>	 (03PS3) 10Ayounsi: Netbox DB dump, hourly on secondary, daily on primary [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677)
[10:26:54] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Revert a few leftover disabled notif., belived to be wrong [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T311106)
[10:26:54] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:27:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/821191 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi)
[10:27:26] <wikibugs>	 (03CR) 10Ayounsi: Netbox DB dump, hourly on secondary, daily on primary (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi)
[10:28:54] <wikibugs>	 (03CR) 10Jcrespo: "Amir: this showed as a conflict- do you know where this comes from? Can it be deleted?" [puppet] - 10https://gerrit.wikimedia.org/r/768653 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot)
[10:29:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[10:29:36] <wikibugs>	 (03CR) 10Jcrespo: "Same here." [puppet] - 10https://gerrit.wikimedia.org/r/768652 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot)
[10:30:02] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox: move db::dump_interval to profile default [puppet] - 10https://gerrit.wikimedia.org/r/821191 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi)
[10:30:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox
[10:30:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi)
[10:30:43] <wikibugs>	 (03PS9) 10David Caro: prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910
[10:31:43] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T311106) (owner: 10Jcrespo)
[10:31:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox DB dump, hourly on secondary, daily on primary [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi)
[10:34:49] <wikibugs>	 (03CR) 10Jbond: "LGTM, although one wonders if the damage is already done :P" [puppet] - 10https://gerrit.wikimedia.org/r/812343 (owner: 10David Caro)
[10:34:50] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:35:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:35:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2079.codfw.wmnet
[10:37:21] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Decommission db2079 [puppet] - 10https://gerrit.wikimedia.org/r/821198 (https://phabricator.wikimedia.org/T313885)
[10:37:25] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Decommission db2079 [puppet] - 10https://gerrit.wikimedia.org/r/821198 (https://phabricator.wikimedia.org/T313885) (owner: 10Ladsgroup)
[10:39:58] <Amir1>	 !log Removing db2079 from zarcillo (T313885)
[10:40:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:01] <stashbot>	 T313885: decommission db2079 - https://phabricator.wikimedia.org/T313885
[10:43:21] <Amir1>	 !log Removing db2079 from orchestrator (T313885)
[10:43:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:01] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2079 - https://phabricator.wikimedia.org/T313885 (10Ladsgroup) This host is ready for DC-Ops to decommission
[10:45:09] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2079 - https://phabricator.wikimedia.org/T313885 (10Ladsgroup) a:03Papaul
[10:46:16] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:47:37] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Revert a few leftover disabled notif., belived to be wrong [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T311106) (owner: 10Jcrespo)
[10:49:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, see comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[10:49:31] <wikibugs>	 (03PS3) 10Ayounsi: Netbox: remove CSV dump directory and timer [puppet] - 10https://gerrit.wikimedia.org/r/821177 (https://phabricator.wikimedia.org/T310615)
[10:51:54] <wikibugs>	 (03PS2) 10Ayounsi: Netbox: remove Puppet config related to CSV dumps [puppet] - 10https://gerrit.wikimedia.org/r/821178 (https://phabricator.wikimedia.org/T310615)
[10:53:35] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox: remove CSV dump directory and timer [puppet] - 10https://gerrit.wikimedia.org/r/821177 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi)
[10:57:44] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] gitignore: add note to use global ignore file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812343 (owner: 10David Caro)
[10:57:51] <wikibugs>	 (03PS4) 10David Caro: gitignore: add note to use global ignore file [puppet] - 10https://gerrit.wikimedia.org/r/812343
[10:59:25] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox: remove Puppet config related to CSV dumps [puppet] - 10https://gerrit.wikimedia.org/r/821178 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi)
[10:59:37] <wikibugs>	 (03PS3) 10Ayounsi: Netbox: remove Puppet config related to CSV dumps [puppet] - 10https://gerrit.wikimedia.org/r/821178 (https://phabricator.wikimedia.org/T310615)
[11:01:56] <jynus>	 this is a weird warning: "WARN: 0 puppet certs need to be renewed:"
[11:02:55] <wikibugs>	 (03CR) 10Abijeet Patro: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro)
[11:03:01] <wikibugs>	 (03PS2) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587)
[11:08:57] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:09:55] <wikibugs>	 (03CR) 10Ayounsi: "This looks like a CI issue?" [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi)
[11:10:44] <wikibugs>	 (03CR) 10Ayounsi: "I think this is safe to merge?" [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond)
[11:12:13] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] provision cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi)
[11:13:46] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+2 C: 03+2] wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792622 (owner: 10Ayounsi)
[11:14:27] <wikibugs>	 (03CR) 10Nikerabbit: Enable message bundle on MetaWiki for WikiLearn (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro)
[11:14:39] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira)
[11:14:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "I think the best solution is to implement 3 separate "modes":" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/820664 (owner: 10Jcrespo)
[11:16:45] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:16:59] <wikibugs>	 (03Merged) 10jenkins-bot: provision cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi)
[11:21:09] <jelto>	 !log kubectl uncordon kubernetes2022.codfw.wmnet
[11:21:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:44] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes2022.codfw.wmnet
[11:33:35] <wikibugs>	 (03PS1) 10Urbanecm: Move WEIGHT_* constants to IMentorWeights [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820870 (https://phabricator.wikimedia.org/T314362)
[11:34:24] <wikibugs>	 (03PS1) 10Urbanecm: MentorTools: Do not use MentorWeightManager [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820871 (https://phabricator.wikimedia.org/T314362)
[11:34:47] <wikibugs>	 (03Abandoned) 10Urbanecm: Move WEIGHT_* constants to IMentorWeights [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820870 (https://phabricator.wikimedia.org/T314362) (owner: 10Urbanecm)
[11:36:34] <urbanecm>	 jouncebot: nowandnext
[11:36:34] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 23 minute(s)
[11:36:34] <jouncebot>	 In 1 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T1300)
[11:36:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] MentorTools: Do not use MentorWeightManager [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820871 (https://phabricator.wikimedia.org/T314362) (owner: 10Urbanecm)
[11:43:22] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1102.eqiad.wmnet
[11:48:40] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:04] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:29] <wikibugs>	 (03CR) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[11:58:38] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:35] <wikibugs>	 (03Merged) 10jenkins-bot: MentorTools: Do not use MentorWeightManager [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820871 (https://phabricator.wikimedia.org/T314362) (owner: 10Urbanecm)
[12:03:44] <icinga-wm>	 PROBLEM - Host an-worker1102 is DOWN: PING CRITICAL - Packet loss = 100%
[12:04:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[12:06:14] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/: 3eaf155678b7313c55dcca0cd39ab29f73eead37: MentorTools: Do not use MentorWeightManager (T314362) (duration: 03m 31s)
[12:06:17] <stashbot>	 T314362: Ensure MentorWeightManager is not used with structured mentor list - https://phabricator.wikimedia.org/T314362
[12:06:21] * urbanecm done
[12:09:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[12:09:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[12:11:02] <icinga-wm>	 PROBLEM - Check systemd state on mw2393 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:12:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[12:25:36] <icinga-wm>	 RECOVERY - Host an-worker1102 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[12:26:22] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:30:16] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1102.eqiad.wmnet
[12:49:42] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Add new rights to wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821215
[12:51:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth: Add new rights to wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821215 (owner: 10Urbanecm)
[12:52:12] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Add new rights to wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821215 (owner: 10Urbanecm)
[12:56:22] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 77fd5abdd7d9462869259e1511bbcf2d7ce62246: Growth: Add new rights to wgAvailableRights (duration: 03m 24s)
[12:58:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[12:59:02] <wikibugs>	 (03PS1) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776)
[12:59:56] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:00:04] <jouncebot>	 RoanKattouw, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T1300).
[13:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:16] * urbanecm waves
[13:00:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36641/console" [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond)
[13:01:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:01:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:02:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Complete testing of SONiC NOS / Dell network gear and write up - https://phabricator.wikimedia.org/T310901 (10cmooney) @ayounsi @Papaul I've done the first draft of the summary here:  https://wikitech.wikimedia.org/wiki/Dell_Enterprise_Sonic_Evaluation  Feel fre...
[13:03:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond)
[13:03:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:08:20] <wikibugs>	 (03PS2) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776)
[13:10:51] <wikibugs>	 (03PS3) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776)
[13:12:04] <wikibugs>	 (03PS4) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776)
[13:12:06] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Makes sense nice work!  If it were Python I'd suggest manipulating the addresses using the ipaddress library rather than string splitting " [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209) (owner: 10David Caro)
[13:12:18] <wikibugs>	 (03PS5) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776)
[13:15:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:gerrit: add ipaddress to host_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819506 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond)
[13:15:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Complete testing of SONiC NOS / Dell network gear and write up - https://phabricator.wikimedia.org/T310901 (10Papaul) @cmooney thanks for putting this together.
[13:29:38] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:32:22] <wikibugs>	 (03PS6) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776)
[13:33:52] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:42:19] <jinxer-wm>	 (ProbeDown) firing: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:42:45] <XioNoX>	 hotlink?
[13:42:49] <wikibugs>	 (03PS1) 10Btullis: Add thirdparty/bigtop15 component to wikimedia-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/821223 (https://phabricator.wikimedia.org/T310643)
[13:43:02] <jbond>	 looking
[13:44:10] <XioNoX>	 jbond: yep, looks like     https://upload.wikimedia.org/wikipedia/commons/d/db/Neha_Hinge_%2811%29.jpg 
[13:44:41] <jbond>	 XioNoX: ack thanks do you know what the action taken previously was?  create a requesctl rule for this image?
[13:44:46] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36642/console" [puppet] - 10https://gerrit.wikimedia.org/r/821223 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis)
[13:45:42] <XioNoX>	 and again it's with referrer https://click-it.me/
[13:45:53] <XioNoX>	 jbond: the spike is over it will self heal
[13:45:53] <cdanis>	 jbond: that's the best option yeah
[13:45:55] <XioNoX>	 heel
[13:45:57] <cdanis>	 or ... just wait
[13:45:59] <cdanis>	 heh
[13:46:19] <XioNoX>	 I was right, heal
[13:46:33] <jbond>	 ack thanks ill see if there is anyone in traffic to try and progress the hotlinking patch
[13:46:34] <XioNoX>	 we can have a strong rate limit for for https://click-it.me/
[13:46:54] <XioNoX>	 jbond: the hotlink patch only apply with an empty referer?
[13:47:02] <XioNoX>	 because now it's set to that url
[13:47:19] <jinxer-wm>	 (ProbeDown) resolved: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:47:23] <jbond>	 XioNoX: no it applies a rate limit to anythin that dosn;t have a referer with a WMF domain
[13:47:30] <XioNoX>	 nice
[13:47:38] <XioNoX>	 so yeah that will help
[13:47:39] <cdanis>	 there is some need to handle 'allowed' referrers like with maps
[13:47:59] <TheresNoTime>	 (just out of idle curiosity, how did you find out which file / figure out that it was a load of hotlink traffic?)
[13:48:05] <XioNoX>	 or temporarily have a strict rate limit for that specific one as we keep seeing it
[13:48:17] <jbond>	 cdanis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/768723 if intrested
[13:48:20] <XioNoX>	 TheresNoTime: experience :)
[13:48:26] <cdanis>	 XioNoX: it should be possible to add per-URL bytes egress limit using haproxy stick-tables
[13:48:28] <XioNoX>	 TheresNoTime: but confirmed with NEL data
[13:48:46] <XioNoX>	 TheresNoTime: dunno if you have access to https://logstash.wikimedia.org/app/dashboards#/view/ee6432c0-82a9-11eb-9d45-739221ba7fb6
[13:48:54] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:48:54] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:48:58] <cdanis>	 ahaha we had enough NEL errors for the image that it showed up there?  I was expecting you to say you looked at centrallog1001 weblog
[13:49:14] <XioNoX>	 cdanis: yeah :)
[13:49:18] <XioNoX>	 checking those
[13:49:21] <cdanis>	 another thing we could do: redirect to another site dynamically
[13:49:24] <XioNoX>	 might be the same with some latency
[13:49:25] <TheresNoTime>	 XioNoX: ah I do, I initially went to logstash but didn't think to look there :) guess that's where the experience comes in :D
[13:49:59] <cdanis>	 the two 'quick' places to look on logstash are varnish5xx and NEL
[13:51:21] <TheresNoTime>	 thank you :) 
[13:51:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for virginiapoundstone - https://phabricator.wikimedia.org/T314676 (10VirginiaPoundstone)
[13:51:41] <XioNoX>	 jbond: you can ack/ignore the librenms alert it should recover, I'm keeping an eye on it
[13:51:50] <jbond>	 ack thanks
[13:52:00] <cdanis>	 XioNoX: I'll add writing up a brief proposal about haproxy stick-tables to my list this week
[13:52:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for virginiapoundstone - https://phabricator.wikimedia.org/T314676 (10VirginiaPoundstone) >>! In T314676#8135321, @Aklapper wrote: > Note that the Phabricator account @VirginiaPoundstone is linked to [a self-created, non-WMF SUL wik...
[13:52:21] <cdanis>	 varnish can't easily do throttling by bytes egress, but haproxy can
[13:53:33] <XioNoX>	 cdanis: could be worth having a "NELs by url" visualisation on the NEL dashboard for those usecases too
[13:53:40] <cdanis>	 indeed
[13:53:49] <jbond>	 +1
[13:53:54] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:53:54] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:53:55] <cdanis>	 XioNoX: please feel free to edit the dashboard or file a task ;)
[13:54:10] <XioNoX>	 cdanis: sorry I can't hear you, you're too far away
[13:54:29] <XioNoX>	 cdanis: cool pour le stick-tables, I'll need a tldr, the task is becoming huge :)
[13:54:54] <cdanis>	 ahahah
[13:55:46] <jbond>	 p
[13:55:53] <cdanis>	 okay I made some notes to myself, off to a meeting now
[13:55:57] <cdanis>	 thanks jbond XioNoX 
[13:56:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:59:10] <jbond>	 np and cheers cdanis, XioNoX :)
[14:00:09] <XioNoX>	 cdanis: I added it ;)
[14:01:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira)
[14:01:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Ok let's see how it goes! If needed we'll prune some isvcs in the future :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira)
[14:06:18] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph:osd: add support for multi-network setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209) (owner: 10David Caro)
[14:09:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add puppet profile and role files for WikiFunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[14:10:53] <wikibugs>	 (03CR) 10Ori: [C: 03+2] Add puppet profile and role files for WikiFunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[14:11:01] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36643/console" [puppet] - 10https://gerrit.wikimedia.org/r/821223 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis)
[14:11:02] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[14:12:29] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Jdforrester-WMF) a:05Jdforrester-WMF→03None
[14:17:54] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:20:48] <wikibugs>	 (03PS1) 10Samtar: logos/manage.py: Use shortened link in user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821246
[14:20:50] <wikibugs>	 (03PS1) 10Elukey: ml-services: update editquality's Docker image and settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/821247 (https://phabricator.wikimedia.org/T301878)
[14:20:52] <wikibugs>	 (03PS1) 10Elukey: ml-services: test the new Docker image for articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/821248 (https://phabricator.wikimedia.org/T301878)
[14:22:31] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase κ – Clean-up): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Jdforrester-WMF)
[14:26:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Add thirdparty/bigtop15 component to wikimedia-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/821223 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis)
[14:26:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update editquality's Docker image and settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/821247 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[14:29:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: test the new Docker image for articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/821248 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[14:33:22] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[14:33:22] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond)
[14:34:27] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10SecTeam-Processed: Add Larissa Gaulia to #mediawiki_security - https://phabricator.wikimedia.org/T314616 (10sbassett) 05In progress→03Resolved
[14:34:49] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[14:41:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Remove CSV dump scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821180 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi)
[14:41:54] <wikibugs>	 (03CR) 10Ahmon Dancy: wmflib: fix ipresolve AAAA string representation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond)
[14:42:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Bump pynetbox to ~= 6.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/820806 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi)
[14:42:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Bump pynetbox to ~= 6.6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/820808 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi)
[14:45:25] <wikibugs>	 (03PS7) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776)
[14:45:36] <wikibugs>	 (03CR) 10Jbond: wmflib: fix ipresolve AAAA string representation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond)
[14:45:50] <wikibugs>	 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10TheresNoTime)
[14:46:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256
[14:46:47] <wikibugs>	 (03PS2) 10Jbond: Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi)
[14:46:47] <stashbot>	 T314256: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256
[14:46:49] <wikibugs>	 (03PS1) 10Jbond: homer: add pyproject.toml [software/homer] - 10https://gerrit.wikimedia.org/r/821254
[14:47:00] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256
[14:49:49] <wikibugs>	 (03PS2) 10Jbond: homer: add pyproject.toml [software/homer] - 10https://gerrit.wikimedia.org/r/821254
[14:50:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::blackbox::http: add/edit parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn)
[14:51:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi)
[14:52:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] homer: add pyproject.toml [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond)
[14:53:04] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:53:27] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::alerting_host: run vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/821255
[14:55:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[14:55:47] <wikibugs>	 (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori)
[14:56:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] homer: add pyproject.toml [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond)
[14:56:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] role::alerting_host: run vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/821255 (owner: 10Giuseppe Lavagetto)
[14:56:58] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "Looks reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche)
[14:58:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "This LGTM, (non blocking, hence the +1) please note that that address *might* receive alerts from non-production alertmanager deployments " [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori)
[14:59:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[15:01:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:02:12] <wikibugs>	 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10TheresNoTime)
[15:02:24] <wikibugs>	 (03PS1) 10Jbond: ganeti-netbox-sync: just use the default CA buyndle [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821257
[15:03:40] <wikibugs>	 (03CR) 10Ori: alertmanager: route abstract-wikipedia-critical alert e-mails to Slack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori)
[15:04:19] <wikibugs>	 (03PS3) 10Jbond: Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi)
[15:06:10] <wikibugs>	 (03PS4) 10Jbond: Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi)
[15:06:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: route abstract-wikipedia-critical alert e-mails to Slack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori)
[15:06:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:08:18] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad: Failed disk in ms-be1066 - https://phabricator.wikimedia.org/T314143 (10Cmjohnson) Case opened, You have successfully submitted request SR148431542.
[15:09:16] <wikibugs>	 (03PS3) 10Jbond: homer: add pyproject.toml [software/homer] - 10https://gerrit.wikimedia.org/r/821254
[15:09:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Manually tested and works fine." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821257 (owner: 10Jbond)
[15:09:56] <wikibugs>	 10SRE-OnFire, 10observability, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10herron) >>! In T313603#8126164, @CDanis wrote:...
[15:10:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10Cmjohnson)
[15:10:15] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "But we need to deploy I482631ebf972e755cd9ef1f11175854c0581bcae first if not already done." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821257 (owner: 10Jbond)
[15:10:46] <wikibugs>	 (03PS5) 10Jbond: Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi)
[15:10:57] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1021 mgmt flapping - https://phabricator.wikimedia.org/T314413 (10Cmjohnson) 05Open→03Resolved swapped cable
[15:11:14] <wikibugs>	 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10MusikAnimal) It's worth mentioning that, like [[ https://www.mediawiki.o...
[15:12:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond)
[15:12:25] <wikibugs>	 (03CR) 10Ahmon Dancy: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond)
[15:14:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove CSV dump scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821180 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi)
[15:15:30] <wikibugs>	 (03CR) 10Ori: alertmanager: route abstract-wikipedia-critical alert e-mails to Slack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori)
[15:16:15] <wikibugs>	 (03PS2) 10Ori: alertmanager: route abstract-wikipedia-critical alert e-mails to Slack [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457)
[15:17:39] <wikibugs>	 (03PS1) 10Andrew Bogott: trove-guestagent.conf: standardize rabbitmq config [puppet] - 10https://gerrit.wikimedia.org/r/821261 (https://phabricator.wikimedia.org/T314522)
[15:19:31] <wikibugs>	 (03PS1) 10Elukey: ml-services: add environment variables to editquality pods/isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/821263 (https://phabricator.wikimedia.org/T301878)
[15:20:18] <jinxer-wm>	 (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:35] <wikibugs>	 (03PS4) 10Jbond: homer: add pyproject.toml [software/homer] - 10https://gerrit.wikimedia.org/r/821254
[15:21:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36644/console" [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori)
[15:22:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/pcc-worker1001/36644/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori)
[15:22:54] <wikibugs>	 (03CR) 10Ori: [C: 03+2] alertmanager: route abstract-wikipedia-critical alert e-mails to Slack [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori)
[15:25:14] <jinxer-wm>	 (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:25:21] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:gerrit: add ipaddress to host_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819506 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond)
[15:26:13] <wikibugs>	 (03CR) 10Ahmon Dancy: P:gerrit: add ipaddress to host_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819506 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond)
[15:27:31] <wikibugs>	 (03PS5) 10Jbond: netbox: increase TTL to 1D [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452)
[15:27:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] netbox: increase TTL to 1D [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond)
[15:27:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "thanks" [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond)
[15:28:32] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:30:05] <jouncebot>	 jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T1530).
[15:30:22] <icinga-wm>	 PROBLEM - Host es2021 is DOWN: PING CRITICAL - Packet loss = 100%
[15:31:23] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase κ – Clean-up): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10BBlack) The existing google IP apparently doesn't even have TLS (just old port 80), so it defaults to an insecure site warning in Chrome.  Google's public reso...
[15:32:02] <icinga-wm>	 PROBLEM - MariaDB Replica IO: es4 on es2022 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2021.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es2021.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:32:18] <icinga-wm>	 PROBLEM - MariaDB Replica IO: es4 on es2020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2021.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on es2021.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:32:38] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1084.eqiad.wmnet with OS bullseye
[15:33:46] <icinga-wm>	 RECOVERY - Host es2021 is UP: PING OK - Packet loss = 0%, RTA = 31.71 ms
[15:35:11] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36645/console" [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata)
[15:35:43] <icinga-wm>	 PROBLEM - MariaDB read only es4 #page on es2021 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[15:36:03] <Amir1>	 I'm around
[15:36:08] <Amir1>	 gonna downtime it
[15:36:16] <bblack>	 acked
[15:36:22] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: es4 on es2021 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:36:34] <bblack>	 planned?
[15:36:46] <jhathaway>	 around as well
[15:36:56] <Amir1>	 to my knowledge
[15:37:20] <icinga-wm>	 PROBLEM - mysqld processes on es2021 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[15:37:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
[15:37:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
[15:37:58] <bblack>	 ok, will resolve
[15:39:39] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Bump pynetbox to ~= 6.6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/820808 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi)
[15:41:12] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: es4 on es2020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 747.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:45:19] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
[15:46:10] <sukhe>	 !log upload reprepro -C main include bullseye-wikimedia python-pynetbox_6.6.0-1+wmf11u1_amd64.changes
[15:46:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:23] <jynus>	 what was the issue?
[15:47:58] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
[15:49:22] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730 (10Papaul)
[15:49:30] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:49:33] <rzl>	 I didn't get paged through VO, did anyone?
[15:49:54] <jynus>	 you shouldnpt have not if everything works as expected
[15:50:03] <rzl>	 oh right it's working hours :) thanks
[15:50:11] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730 (10Papaul) 05Open→03Resolved complete
[15:51:18] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: db2135 (C6) lost power supply redundancy - https://phabricator.wikimedia.org/T314628 (10Papaul) 05Open→03Resolved This is complete
[15:53:57] <jbond>	 however i didn't get paged either
[15:54:00] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: es4 on es2022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1516.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:54:38] <jynus>	 jbond: maybe your schedule had finished already?
[15:54:44] <jynus>	 let me see
[15:54:51] <jbond>	 it shouldn;t finish for another 6 minutes
[15:54:53] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/820777 (owner: 10Faidon Liambotis)
[15:54:57] <jbond>	 thanks jynus 
[15:55:26] <Amir1>	 I got the page FWIW
[15:55:44] <icinga-wm>	 RECOVERY - IPMI Sensor Status on es2021 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:55:51] <jbond>	 could be from the chnages this morning i may have got put on an earlier shift
[15:55:58] <jynus>	 I see ir alrady finished
[15:56:18] <jynus>	 note it is on Cathal's name
[15:57:32] <jynus>	 I wonder if you edited/checked your batphone schedule, not the emea pool 2
[15:58:01] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
[15:58:05] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[15:58:08] <jbond>	 jynus: i have not edited anything
[15:58:17] <jbond>	 the ui confuses me too much :)
[15:58:27] <jbond>	 i do see here though https://portal.victorops.com/dash/wikimedia#/team/team-ra3ayi0mHc3Nr6qu/on-call-schedule that im no longer mentioned
[15:58:30] <jbond>	 for today
[15:58:38] <jynus>	 jbond: when you start your week you are supposed to edit it to adjust to your prefered schedule
[15:58:45] <jynus>	 as per manual
[15:59:02] <jbond>	 jynus: cdanis: has added a bot that should do that automatically
[15:59:08] <jynus>	 ah
[15:59:12] <jbond>	 cdanis: please correct me if im wrong
[15:59:18] <jynus>	 that part I didnt know
[15:59:27] <jbond>	 i think it got added last week at some point
[15:59:34] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Michael)
[15:59:34] <jynus>	 but 
[15:59:43] <cdanis>	 jbond: no, that's for not needing to edit "immediate" vs "5 minutes" when the business hours rotation is in effect
[15:59:46] <cdanis>	 for escalation to batphone
[15:59:49] <jynus>	 I can confirm your schedule finished already
[16:00:01] <cdanis>	 at the start of the week, the oncallers are still supposed to edit the business hours rotation to your preferred hours
[16:00:17] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Michael)
[16:00:17] <jynus>	 sorry it is confusing, you are not alone :-)
[16:00:20] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1084.eqiad.wmnet with OS bullseye
[16:00:24] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Michael)
[16:00:31] <cdanis>	 perhaps I'll go over it briefly in the meeting
[16:00:36] <jbond>	 cdanis: ahh yes that bit i did with leo (or should i say leo did for me) 
[16:00:47] <jbond>	 however i think there was some issue that jynus fixed for me this morning
[16:01:04] <jynus>	 hopefully it will get easier with time + improvements
[16:01:09] <cdanis>	 ah
[16:01:10] * jbond hopes
[16:01:23] <jynus>	 but I just touched the override for cathal, not the actual schedule
[16:01:35] <jynus>	 it could be reseted it, though
[16:02:38] <jynus>	 in any case, please adjust the hours to the ones you prefer now :-D
[16:02:50] <jbond>	 jynus: yes will do 
[16:03:17] <jynus>	 cdanis: there was an issue with the automation, I think, not sure if saw scrollback
[16:03:25] <jynus>	 during the weekend
[16:03:40] <cdanis>	 jynus: victorops was erroneously configured, was the issue
[16:03:47] <jynus>	 yeah
[16:03:48] <cdanis>	 (this weekend)
[16:03:53] <cdanis>	 not an issue with the automation ;)
[16:03:53] <jynus>	 don't know the details, sorry
[16:04:15] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
[16:04:19] <jynus>	 yeah, sorry I didn't mean automation, as something in the procedure or something
[16:04:38] <jynus>	 I don't know the details, joe was more involved on that part
[16:05:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add environment variables to editquality pods/isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/821263 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[16:09:24] <wikibugs>	 10SRE, 10serviceops-radar, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10ayounsi) FYI, I made this dashboard a while ago: https://logstash.wikimedia.org/app/dashboards#/view/AWm67Kpk8aQffZ3HmRpW hopefully it ca...
[16:09:41] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
[16:09:45] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[16:10:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[16:10:41] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 (10Papaul) 05Open→03Resolved This is complete
[16:10:47] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul)
[16:11:08] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: es4 on es2021 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:12:04] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[16:12:58] <icinga-wm>	 RECOVERY - MariaDB Replica IO: es4 on es2022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:13:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:14:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[16:14:35] <icinga-wm>	 RECOVERY - MariaDB read only es4 #page on es2021 is OK: Version 10.4.25-MariaDB-log, Uptime 393s, read_only: True, event_scheduler: True, 29.45 QPS, connection latency: 0.003711s, query latency: 0.000540s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[16:16:17] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314607 (10Papaul) 05Open→03Declined There is already a task for this @ T314509
[16:16:18] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:29] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
[16:16:32] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[16:16:39] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
[16:16:50] <icinga-wm>	 RECOVERY - MariaDB Replica IO: es4 on es2020 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:17:00] <wikibugs>	 (03CR) 10Tacsipacsi: "Why is this far away from other Translate settings ($wmgUseTranslate, $wmgTranslateWorkflowStates etc.)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro)
[16:18:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:19:22] <icinga-wm>	 RECOVERY - mysqld processes on es2021 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[16:19:36] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
[16:20:05] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2088 - https://phabricator.wikimedia.org/T313797 (10Papaul)
[16:20:26] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2088 - https://phabricator.wikimedia.org/T313797 (10Papaul) 05Open→03Resolved complete
[16:24:58] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic1085.eqiad.wmnet with OS bullseye
[16:25:20] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2090 - https://phabricator.wikimedia.org/T314109 (10Papaul)
[16:25:59] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2090 - https://phabricator.wikimedia.org/T314109 (10Papaul) 05Open→03Resolved complete
[16:26:09] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
[16:26:10] <wikibugs>	 (03PS1) 10Hnowlan: Create basic haproxy container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196)
[16:27:55] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: es4 on es2020 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:30:42] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:32:08] <wikibugs>	 (03PS2) 10Hnowlan: Create basic haproxy container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196)
[16:33:31] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) @fgiunchedi can you please take a look at this alert i see only Smart Storage Battery failed and no disk failed.   Thanks
[16:33:45] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10Papaul) @fgiunchedi can you please take a look at this alert i see only Smart Storage Battery failed and no disk failed.   Thanks
[16:33:51] <wikibugs>	 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10Legoktm) I don't fully understand how FSFileBackend will work here, as t...
[16:36:00] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2072 - https://phabricator.wikimedia.org/T313911 (10Papaul)
[16:36:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] ganeti-netbox-sync: just use the default CA buyndle [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821257 (owner: 10Jbond)
[16:36:43] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2072 - https://phabricator.wikimedia.org/T313911 (10Papaul) 05Open→03Resolved complete
[16:38:01] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata)
[16:38:01] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking)
[16:38:10] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
[16:38:52] <icinga-wm>	 PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:39:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[16:40:16] <icinga-wm>	 RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:40:22] <wikibugs>	 (03CR) 10David Caro: "Hmm, by the logs of the failure it's the flake8 test not prospector." [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond)
[16:41:16] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:41:39] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
[16:42:53] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2079 - https://phabricator.wikimedia.org/T313885 (10Papaul)
[16:43:11] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:46:21] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2079 - https://phabricator.wikimedia.org/T313885 (10Papaul) 05Open→03Resolved complete
[16:49:49] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
[16:49:54] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[16:51:37] <wikibugs>	 (03PS1) 10Btullis: Fix the spark3 profile [puppet] - 10https://gerrit.wikimedia.org/r/821278 (https://phabricator.wikimedia.org/T295072)
[16:52:17] <wikibugs>	 10SRE, 10DynamicPageList (Wikimedia), 10serviceops, 10Patch-For-Review, and 7 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Krinkle)
[16:52:59] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36646/console" [puppet] - 10https://gerrit.wikimedia.org/r/821278 (https://phabricator.wikimedia.org/T295072) (owner: 10Btullis)
[16:53:47] <wikibugs>	 (03PS7) 10Ayounsi: sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380
[16:53:49] <wikibugs>	 (03PS9) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730
[16:54:02] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1085.eqiad.wmnet with OS bullseye
[16:54:48] <wikibugs>	 (03PS8) 10Ayounsi: sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380
[16:54:57] <wikibugs>	 (03CR) 10Ayounsi: sre.network.debug: initial commit (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 (owner: 10Ayounsi)
[16:55:18] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix the spark3 profile [puppet] - 10https://gerrit.wikimedia.org/r/821278 (https://phabricator.wikimedia.org/T295072) (owner: 10Btullis)
[16:56:18] <wikibugs>	 (03PS1) 10Ahmon Dancy: DevServices.php: Add placeholder for image-suggestion service [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/821279
[16:57:56] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] DevServices.php: Add placeholder for image-suggestion service [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/821279 (owner: 10Ahmon Dancy)
[16:58:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi)
[16:59:03] <wikibugs>	 (03Merged) 10jenkins-bot: DevServices.php: Add placeholder for image-suggestion service [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/821279 (owner: 10Ahmon Dancy)
[16:59:48] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2089 - https://phabricator.wikimedia.org/T313799 (10Papaul)
[17:00:02] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2089 - https://phabricator.wikimedia.org/T313799 (10Papaul) 05Open→03Resolved complete
[17:00:05] <jouncebot>	 ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T1700).
[17:00:28] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1088.eqiad.wmnet with OS bullseye
[17:01:54] <icinga-wm>	 PROBLEM - Check systemd state on elastic1065 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:02:27] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into  moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) @LSobanski hello any update on this?  Thanks
[17:03:39] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Reedy)
[17:04:22] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: es4 on es2022 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:09:58] <icinga-wm>	 RECOVERY - Check systemd state on elastic1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:12:51] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
[17:15:46] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
[17:34:12] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1088.eqiad.wmnet with OS bullseye
[17:37:39] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10KFrancis) @Dzahn I am confirming the NDA has been signed.  Please proceed with the access request.  Thanks!
[17:58:36] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:59:23] <wikibugs>	 (03PS1) 10CDanis: Add a check_esc_policy_config subcommand [software/klaxon] - 10https://gerrit.wikimedia.org/r/821287 (https://phabricator.wikimedia.org/T313603)
[17:59:29] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:01:21] <wikibugs>	 (03PS2) 10CDanis: Add a check_esc_policy_config subcommand [software/klaxon] - 10https://gerrit.wikimedia.org/r/821287 (https://phabricator.wikimedia.org/T313603)
[18:20:01] <wikibugs>	 (03PS1) 10Jdlrobson: Fix grid blowout bug [skins/Vector] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821243 (https://phabricator.wikimedia.org/T314756)
[18:23:57] <wikibugs>	 (03PS2) 10Clare Ming: Disable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817785 (https://phabricator.wikimedia.org/T312296)
[18:40:23] <wikibugs>	 (03PS1) 10Ottomata: Don't hardcode /opt/conda-analytics in spark3.env.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/821293 (https://phabricator.wikimedia.org/T312882)
[18:41:28] <wikibugs>	 (03PS3) 10CDanis: haproxy: properly track client concurrency, & more [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580)
[18:43:04] <wikibugs>	 (03PS3) 10Ori: abstract-wikipedia alert: increase timeout; correct team name [puppet] - 10https://gerrit.wikimedia.org/r/821294 (https://phabricator.wikimedia.org/T311457)
[18:43:13] <wikibugs>	 (03CR) 10Mary Yang: [C: 03+1] abstract-wikipedia alert: increase timeout; correct team name [puppet] - 10https://gerrit.wikimedia.org/r/821294 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori)
[18:45:31] <wikibugs>	 (03PS4) 10CDanis: haproxy: properly track client concurrency, & more [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580)
[18:45:44] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[18:46:42] <wikibugs>	 (03CR) 10Ori: [C: 03+2] abstract-wikipedia alert: increase timeout; correct team name [puppet] - 10https://gerrit.wikimedia.org/r/821294 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori)
[18:51:16] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:52:22] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "Valentin: I'm going to merge this now so we can start gathering correct stats as quickly as possible, but I'm very happy to take comments " [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[18:56:07] <wikibugs>	 (03PS1) 10Ottomata: Fix sudo rules for airflow platform eng admins [puppet] - 10https://gerrit.wikimedia.org/r/821296 (https://phabricator.wikimedia.org/T313727)
[18:58:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] trove-guestagent.conf: standardize rabbitmq config [puppet] - 10https://gerrit.wikimedia.org/r/821261 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott)
[19:01:50] <wikibugs>	 (03Abandoned) 10Andrew Bogott: nova.conf: remove auth_strategy = keystone [puppet] - 10https://gerrit.wikimedia.org/r/820758 (owner: 10Andrew Bogott)
[19:03:55] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Fix sudo rules for airflow platform eng admins [puppet] - 10https://gerrit.wikimedia.org/r/821296 (https://phabricator.wikimedia.org/T313727) (owner: 10Ottomata)
[19:12:43] <wikibugs>	 (03PS1) 10Andrew Bogott: nova.conf: remove auth_strategy = keystone [puppet] - 10https://gerrit.wikimedia.org/r/821297
[19:12:45] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack::cinder: use TLS on rabbitmq connections [puppet] - 10https://gerrit.wikimedia.org/r/821298 (https://phabricator.wikimedia.org/T297268)
[19:14:38] <wikibugs>	 10SRE, 10Performance-Team, 10serviceops, 10Patch-For-Review: Clean up testwiki experiments - https://phabricator.wikimedia.org/T314750 (10Krinkle) p:05Triage→03Low a:03tstarling
[19:14:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: remove auth_strategy = keystone [puppet] - 10https://gerrit.wikimedia.org/r/821297 (owner: 10Andrew Bogott)
[19:15:08] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:22:12] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:25:10] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:25:58] <wikibugs>	 (03PS1) 10CDanis: haproxy: fix excess_concurrency/would_drop debug logging [puppet] - 10https://gerrit.wikimedia.org/r/821300 (https://phabricator.wikimedia.org/T306580)
[19:26:46] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "same proviso as previous patch ツ" [puppet] - 10https://gerrit.wikimedia.org/r/821300 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[19:28:13] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "PCC LGTM https://puppet-compiler.wmflabs.org/pcc-worker1001/36648/cp2027.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/821300 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[19:28:44] <cdanis>	 andrewbogott: puppet-merging your patch as well
[19:28:53] <andrewbogott>	 thanks!
[19:34:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic)
[19:34:08] <wikibugs>	 (03PS1) 10CDanis: haproxy: bump concurrency threshold [puppet] - 10https://gerrit.wikimedia.org/r/821301 (https://phabricator.wikimedia.org/T306580)
[19:34:48] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] haproxy: bump concurrency threshold [puppet] - 10https://gerrit.wikimedia.org/r/821301 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[19:43:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) Thanks for chiming in @ayounsi. Now that things are not urgently broken I have time to engage with your questions :)  > Given the...
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T2000).
[20:00:05] <jouncebot>	 cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:01:06] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10netbox, 10PostgreSQL: Puppet change at each run on postgres replicas - https://phabricator.wikimedia.org/T311156 (10ayounsi) 05Open→03Resolved This seems to be fixed based on puppetboard.
[20:02:08] <cjming>	 i am the only one on the list so i will deploy
[20:04:20] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Disable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817785 (https://phabricator.wikimedia.org/T312296) (owner: 10Clare Ming)
[20:04:44] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Fix grid blowout bug [skins/Vector] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821243 (https://phabricator.wikimedia.org/T314756) (owner: 10Jdlrobson)
[20:05:09] <wikibugs>	 (03Merged) 10jenkins-bot: Disable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817785 (https://phabricator.wikimedia.org/T312296) (owner: 10Clare Ming)
[20:07:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) @TAndic from reading their docs, I think they only support sending from a subdomain of wikimedia.org:  > Attention: Adding a custom FROM d...
[20:08:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:11:24] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817785|Disable sticky header edit A/B test for pilot wikis (T312296)]] (duration: 03m 35s)
[20:11:27] <stashbot>	 T312296: Disable sticky header edit button A/B test for pilot wikis - https://phabricator.wikimedia.org/T312296
[20:11:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:11:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:11:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) a:03jhathaway
[20:12:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:20:49] <wikibugs>	 (03Merged) 10jenkins-bot: Fix grid blowout bug [skins/Vector] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821243 (https://phabricator.wikimedia.org/T314756) (owner: 10Jdlrobson)
[20:27:34] <logmsgbot>	 !log cjming@deploy1002 Synchronized php-1.39.0-wmf.23/skins/Vector/resources/skins.vector.styles/layouts/grid.less: Backport: [[gerrit:821243|Fix grid blowout bug (T314756)]] (duration: 03m 26s)
[20:27:37] <stashbot>	 T314756: Grid blowout on various pages with long elements - https://phabricator.wikimedia.org/T314756
[20:27:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:28:03] <cjming>	 !log end of UTC late backport window
[20:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:21] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
[20:29:24] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[20:31:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:31:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:32:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:36:12] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1062.eqiad.wmnet with OS bullseye
[20:36:20] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1062.eqiad.wmnet with OS bullseye
[20:41:36] <wikibugs>	 (03PS1) 10Clare Ming: Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295)
[20:50:10] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack::nova and ::neutron: use service names for rabbit nodes [puppet] - 10https://gerrit.wikimedia.org/r/821311 (https://phabricator.wikimedia.org/T314522)
[20:50:50] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
[21:33:44] <icinga-wm_>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:34:36] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Requesting access to wmf and ops for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10BCornwall) a:05Joe→03BCornwall
[21:34:51] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Requesting access to wmf and ops for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10BCornwall)
[21:36:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Requesting access to wmf and ops for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10BCornwall) p:05Triage→03Medium
[21:36:13] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Requesting access to wmf and ops for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10BCornwall) @akosiaris Can you sign off with your approval that this user is indeed the one to grant access?
[21:36:25] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1065.eqiad.wmnet with OS bullseye
[21:36:32] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1065.eqiad.wmnet with OS bullseye
[21:38:20] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10BCornwall) p:05Triage→03Medium
[21:43:58] <wikibugs>	 10SRE, 10Acme-chief, 10Patch-For-Review: acme-chief is down: ValueError: OCSP response status is not successful so the property has no value - https://phabricator.wikimedia.org/T282490 (10BCornwall) 05Open→03Resolved a:03BCornwall @Dzahn I'm assuming you meant 0.3, which has long since been deployed. I...
[21:45:03] <wikibugs>	 (03PS1) 10Stang: Revert "trwiki: Change old and new vector logos for 500k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821330
[21:45:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10BCornwall) p:05Triage→03Medium
[21:45:15] <wikibugs>	 (03PS2) 10Stang: Revert "trwiki: Change old and new vector logos for 500k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821330
[21:50:25] <wikibugs>	 (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil)
[21:50:34] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
[21:53:27] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
[21:59:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic) @jhathaway Good catch! I've set up one for **surveys.wikimedia.org** and updated the configuration in the sheet I shared, starting on line 11...
[21:59:31] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:04:42] <wikibugs>	 (03PS2) 10Clare Ming: Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295)
[22:10:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) >>! In T314815#8139092, @TAndic wrote: > @jhathaway Good catch! I've set up one for **surveys.wikimedia.org** and updated the configuratio...
[22:16:39] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic1065.eqiad.wmnet with OS bullseye
[22:16:45] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1065.eqiad.wmnet with OS bullseye completed: - elastic1062 (...
[22:16:47] <logmsgbot>	 !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
[22:16:49] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1065.eqiad.wmnet with OS bullseye executed with errors: - el...
[22:16:50] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[22:18:22] <wikibugs>	 (03PS1) 10Clare Ming: Enable sticky header edit test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821319 (https://phabricator.wikimedia.org/T312573)
[22:38:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic) For the difference, I'm specifically looking at Step 13 of https://www.qualtrics.com/support/survey-platform/distributions-module/email-distr...
[22:44:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:49:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:58:07] <wikibugs>	 10SRE, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Aklapper) What is the relation of this task to `T214201`? Does this one block the other one (=subtask)?
[23:06:37] <wikibugs>	 (03Abandoned) 10BryanDavis: rabbitmq: Fix SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816001 (https://phabricator.wikimedia.org/T308013) (owner: 10BryanDavis)
[23:28:58] <wikibugs>	 (03PS2) 10Tim Starling: Remove abandoned Echo job queue test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821041 (https://phabricator.wikimedia.org/T314750)
[23:29:00] <wikibugs>	 (03PS2) 10Tim Starling: Remove testwiki example.org link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821042 (https://phabricator.wikimedia.org/T314750)
[23:29:02] <wikibugs>	 (03PS2) 10Tim Starling: Remove wgVectorResponsive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821043 (https://phabricator.wikimedia.org/T314750)
[23:29:04] <wikibugs>	 (03PS2) 10Tim Starling: Remove override for wgRevisionCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821044 (https://phabricator.wikimedia.org/T314750)
[23:29:06] <wikibugs>	 (03PS2) 10Tim Starling: Remove testwiki wgTorTagChanges override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821045 (https://phabricator.wikimedia.org/T314750)
[23:29:08] <wikibugs>	 (03PS2) 10Tim Starling: Remove testwiki live preview demo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821046 (https://phabricator.wikimedia.org/T314750)
[23:29:10] <wikibugs>	 (03PS2) 10Tim Starling: Remove unnecessary override for wmgUseCLDR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821047 (https://phabricator.wikimedia.org/T314750)
[23:29:12] <wikibugs>	 (03PS2) 10Tim Starling: Remove wmgDisplayFeedsInSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821048 (https://phabricator.wikimedia.org/T314750)
[23:29:14] <wikibugs>	 (03PS2) 10Tim Starling: Remove wmgUseWikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821049 (https://phabricator.wikimedia.org/T310365)
[23:31:41] <wikibugs>	 (03PS1) 10Cwhite: logstash route k8s logs from proxy,httpd containers to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139)
[23:31:45] <wikibugs>	 (03PS1) 10Cwhite: logstash: reduce webrequest retention to 31 days [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T314139)
[23:32:42] <wikibugs>	 (03PS2) 10Cwhite: logstash route k8s logs from proxy,httpd containers to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139)
[23:33:01] <wikibugs>	 (03PS3) 10Cwhite: logstash route k8s logs from proxy,httpd containers to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139)
[23:33:44] <icinga-wm_>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:36:00] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Remove wmgUseWikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821049 (https://phabricator.wikimedia.org/T310365) (owner: 10Tim Starling)
[23:36:05] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Remove wmgDisplayFeedsInSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821048 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:36:14] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Remove unnecessary override for wmgUseCLDR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821047 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:36:23] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Remove testwiki live preview demo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821046 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:36:33] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Remove testwiki wgTorTagChanges override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821045 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:36:42] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Remove override for wgRevisionCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821044 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:36:53] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Remove wgVectorResponsive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821043 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:37:07] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Remove testwiki example.org link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821042 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:37:19] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Remove abandoned Echo job queue test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821041 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:38:23] <wikibugs>	 (03Merged) 10jenkins-bot: Remove abandoned Echo job queue test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821041 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:38:26] <wikibugs>	 (03Merged) 10jenkins-bot: Remove testwiki example.org link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821042 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:38:29] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wgVectorResponsive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821043 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:38:33] <wikibugs>	 (03Merged) 10jenkins-bot: Remove override for wgRevisionCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821044 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:38:42] <wikibugs>	 (03Merged) 10jenkins-bot: Remove testwiki wgTorTagChanges override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821045 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:38:44] <wikibugs>	 (03Merged) 10jenkins-bot: Remove testwiki live preview demo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821046 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:38:47] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unnecessary override for wmgUseCLDR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821047 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:39:15] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wmgDisplayFeedsInSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821048 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling)
[23:39:23] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wmgUseWikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821049 (https://phabricator.wikimedia.org/T310365) (owner: 10Tim Starling)
[23:45:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:46:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:46:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:46:49] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: clean up testwiki experiments T314750 (duration: 03m 27s)
[23:46:53] <stashbot>	 T314750: Clean up testwiki experiments - https://phabricator.wikimedia.org/T314750
[23:47:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[23:52:20] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: clean up testwiki experiments T314750 (duration: 03m 19s)
[23:52:24] <stashbot>	 T314750: Clean up testwiki experiments - https://phabricator.wikimedia.org/T314750
[23:53:06] <wikibugs>	 10SRE, 10Performance-Team, 10serviceops, 10Patch-For-Review: Clean up testwiki experiments - https://phabricator.wikimedia.org/T314750 (10tstarling) 05Open→03Resolved
[23:53:14] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling)
[23:59:48] <wikibugs>	 (03PS2) 10Andrew Bogott: openstack::nova: use TLS on rabbitmq connections [puppet] - 10https://gerrit.wikimedia.org/r/821298 (https://phabricator.wikimedia.org/T297268)