[00:04:17] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubernetes1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1011 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:19:21] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:20:57] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:25:13] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:26:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:29:53] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:29:59] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:31:06] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948687 (owner: 10TrainBranchBot)
[00:31:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:38:28] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948694
[00:38:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948694 (owner: 10TrainBranchBot)
[00:57:05] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948694 (owner: 10TrainBranchBot)
[00:57:29] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:02:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:06:37] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:09:42] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940486 (https://phabricator.wikimedia.org/T342484) (owner: 10Hamish)
[01:17:31] <wikibugs>	 (03CR) 10Anzx: add 'autopatrol' to Wikifunctions' functioneer group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757)
[01:21:22] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploy to freshly reimaged host
[01:21:33] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploy to freshly reimaged host (duration: 00m 10s)
[01:22:03] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploy to freshly reimaged host
[01:22:14] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploy to freshly reimaged host (duration: 00m 10s)
[01:26:33] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs.data-transfer: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/949146
[01:29:33] <taavi>	 jouncebot: nowandnext
[01:29:33] <jouncebot>	 No deployments scheduled for the next 4 hour(s) and 30 minute(s)
[01:29:33] <jouncebot>	 In 4 hour(s) and 30 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0600)
[01:30:07] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[01:30:30] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[01:31:43] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:34:54] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[01:35:03] <logmsgbot>	 !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[01:36:23] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 533 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:36:46] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[01:37:14] <wikibugs>	 (03PS1) 10Majavah: OAuthUserRepository: Ensure we don't end up with duplicate rows [extensions/OATHAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949166 (https://phabricator.wikimedia.org/T242031)
[01:37:52] <wikibugs>	 (03PS1) 10Majavah: OAuthUserRepository: Ensure we don't end up with duplicate rows [extensions/OATHAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949167 (https://phabricator.wikimedia.org/T242031)
[01:39:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/OATHAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949166 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[01:39:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/OATHAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949167 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[01:41:39] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[01:41:49] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:45:14] <wikibugs>	 (03Merged) 10jenkins-bot: OAuthUserRepository: Ensure we don't end up with duplicate rows [extensions/OATHAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949166 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[01:45:17] <wikibugs>	 (03Merged) 10jenkins-bot: OAuthUserRepository: Ensure we don't end up with duplicate rows [extensions/OATHAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949167 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[01:46:01] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:949166|OAuthUserRepository: Ensure we don't end up with duplicate rows (T242031)]], [[gerrit:949167|OAuthUserRepository: Ensure we don't end up with duplicate rows (T242031)]]
[01:46:05] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[01:47:39] <logmsgbot>	 !log taavi@deploy1002 taavi: Backport for [[gerrit:949166|OAuthUserRepository: Ensure we don't end up with duplicate rows (T242031)]], [[gerrit:949167|OAuthUserRepository: Ensure we don't end up with duplicate rows (T242031)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD
[01:47:39] <logmsgbot>	 option)
[01:50:29] <logmsgbot>	 !log taavi@deploy1002 taavi: Continuing with sync
[01:51:07] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[01:54:16] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[01:56:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:57:01] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949166|OAuthUserRepository: Ensure we don't end up with duplicate rows (T242031)]], [[gerrit:949167|OAuthUserRepository: Ensure we don't end up with duplicate rows (T242031)]] (duration: 10m 59s)
[01:57:01] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:57:04] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[02:01:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:11:38] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:31] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:16:07] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:20:29] <taavi>	 !log create oathauth_devices and oathauth_types tables on wikitech, private.dblist, fishbowl.dblist, centralauth T242031
[02:20:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:20:33] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[02:31:38] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:31:43] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:45:33] <wikibugs>	 (03PS1) 10Majavah: Add READ_NEW | WRITE_NEW for OAuth multiple devices to techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949161 (https://phabricator.wikimedia.org/T242031)
[02:49:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:53:47] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:56:51] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:56:53] <wikibugs>	 (03PS2) 10Mdaniels5757: add 'autopatrol' to Wikifunctions' functioneer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085)
[03:14:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:16:13] <wikibugs>	 (03PS2) 10Majavah: Set WRITE_BOTH for OAuth multiple devices to techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949161 (https://phabricator.wikimedia.org/T242031)
[03:18:39] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:22:01] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:22:44] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[03:22:45] <wikibugs>	 (03PS1) 10Majavah: Keep both tables up-to-date on WRITE_BOTH [extensions/OATHAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949168 (https://phabricator.wikimedia.org/T242031)
[03:23:05] <wikibugs>	 (03PS1) 10Majavah: Keep both tables up-to-date on WRITE_BOTH [extensions/OATHAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949169 (https://phabricator.wikimedia.org/T242031)
[03:23:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/OATHAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949169 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[03:23:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/OATHAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949168 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[03:24:00] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[03:26:01] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:26:17] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:27:11] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[03:27:49] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:28:54] <wikibugs>	 (03Merged) 10jenkins-bot: Keep both tables up-to-date on WRITE_BOTH [extensions/OATHAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949169 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[03:28:56] <wikibugs>	 (03Merged) 10jenkins-bot: Keep both tables up-to-date on WRITE_BOTH [extensions/OATHAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949168 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[03:29:21] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:29:27] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:949169|Keep both tables up-to-date on WRITE_BOTH (T242031)]], [[gerrit:949168|Keep both tables up-to-date on WRITE_BOTH (T242031)]]
[03:29:32] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[03:30:51] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[03:31:05] <logmsgbot>	 !log taavi@deploy1002 taavi: Backport for [[gerrit:949169|Keep both tables up-to-date on WRITE_BOTH (T242031)]], [[gerrit:949168|Keep both tables up-to-date on WRITE_BOTH (T242031)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[03:33:55] <logmsgbot>	 !log taavi@deploy1002 taavi: Continuing with sync
[03:34:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:38:37] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:40:26] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949169|Keep both tables up-to-date on WRITE_BOTH (T242031)]], [[gerrit:949168|Keep both tables up-to-date on WRITE_BOTH (T242031)]] (duration: 10m 58s)
[03:40:30] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[03:43:09] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Set WRITE_BOTH for OAuth multiple devices to techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949161 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[03:43:49] <wikibugs>	 (03Merged) 10jenkins-bot: Set WRITE_BOTH for OAuth multiple devices to techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949161 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[03:44:25] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:949161|Set WRITE_BOTH for OAuth multiple devices to techconductwiki (T242031)]]
[03:45:58] <logmsgbot>	 !log taavi@deploy1002 taavi: Backport for [[gerrit:949161|Set WRITE_BOTH for OAuth multiple devices to techconductwiki (T242031)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[03:46:02] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[03:47:10] <logmsgbot>	 !log taavi@deploy1002 taavi: Continuing with sync
[03:48:41] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for suwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949189 (https://phabricator.wikimedia.org/T343539)
[03:51:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:51:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:53:32] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949161|Set WRITE_BOTH for OAuth multiple devices to techconductwiki (T242031)]] (duration: 09m 07s)
[03:53:35] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[03:55:28] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Initial configuration for suwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949189 (https://phabricator.wikimedia.org/T343539) (owner: 10Zabe)
[03:55:41] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:55:49] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:56:08] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for suwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949189 (https://phabricator.wikimedia.org/T343539) (owner: 10Zabe)
[03:56:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:58:11] <zabe>	 !log create Wikisource Sundanese # T343539
[03:58:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:58:14] <stashbot>	 T343539: Create Wikisource Sundanese - https://phabricator.wikimedia.org/T343539
[03:58:51] <logmsgbot>	 !log zabe@deploy1002 Started scap: T343539
[04:00:41] <logmsgbot>	 !log zabe@deploy1002 zabe: T343539 synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[04:01:14] <logmsgbot>	 !log zabe@deploy1002 zabe: Continuing with sync
[04:07:40] <logmsgbot>	 !log zabe@deploy1002 Finished scap: T343539 (duration: 08m 49s)
[04:07:46] <stashbot>	 T343539: Create Wikisource Sundanese - https://phabricator.wikimedia.org/T343539
[04:10:44] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948705
[04:10:46] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948705 (owner: 10Zabe)
[04:11:01] <logmsgbot>	 !log zabe@deploy1002 Started scap: update interwiki cache
[04:11:23] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948705 (owner: 10Zabe)
[04:19:18] <logmsgbot>	 !log zabe@deploy1002 Finished scap: update interwiki cache (duration: 08m 17s)
[04:21:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:23:45] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for Wiktionary Pa'O [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949191 (https://phabricator.wikimedia.org/T343540)
[04:24:24] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Initial configuration for Wiktionary Pa'O [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949191 (https://phabricator.wikimedia.org/T343540) (owner: 10Zabe)
[04:25:04] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for Wiktionary Pa'O [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949191 (https://phabricator.wikimedia.org/T343540) (owner: 10Zabe)
[04:25:45] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:28:19] <logmsgbot>	 !log zabe@deploy1002 Started scap: T343540
[04:28:23] <stashbot>	 T343540: Create Wiktionary Pa'O - https://phabricator.wikimedia.org/T343540
[04:29:17] <zabe>	 !log create Wiktionary Pa'O # T343540
[04:29:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:29:54] <logmsgbot>	 !log zabe@deploy1002 zabe: T343540 synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[04:30:15] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:31:12] <logmsgbot>	 !log zabe@deploy1002 zabe: Continuing with sync
[04:31:31] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[04:37:34] <logmsgbot>	 !log zabe@deploy1002 Finished scap: T343540 (duration: 09m 15s)
[04:37:38] <stashbot>	 T343540: Create Wiktionary Pa'O - https://phabricator.wikimedia.org/T343540
[04:38:21] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[04:39:25] <icinga-wm>	 PROBLEM - Check systemd state on mw2330 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:39:54] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949207
[04:39:56] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949207 (owner: 10Zabe)
[04:40:36] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949207 (owner: 10Zabe)
[04:41:20] <logmsgbot>	 !log zabe@deploy1002 Started scap: update interwiki cache
[04:44:58] <wikibugs>	 (03CR) 10Vipz: [C: 03+1] Add "editautopatrolprotected", "patrol", "rollback" protection levels on sh.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (owner: 10Acamicamacaraca)
[04:48:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:49:26] <wikibugs>	 (03CR) 10Jforrester: "This doesn't need to go through Engineering but Product. It doesn't seem on-wiki that this has consensus yet, or a model of what rights wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757)
[04:49:33] <logmsgbot>	 !log zabe@deploy1002 Finished scap: update interwiki cache (duration: 08m 12s)
[04:50:17] <wikibugs>	 (03PS3) 10Acamicamacaraca: Add "editautopatrolprotected", "patrol", "rollback" protection levels on sh.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171
[04:51:41] <wikibugs>	 (03PS4) 10Acamicamacaraca: Add "editautopatrolprotected", "rollback" and "patrol" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306)
[04:53:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:55:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:56:21] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:11:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 210, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:12:09] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:15:01] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:17:56] <wikibugs>	 (03PS5) 10Acamicamacaraca: Add "editautopatrolprotected", "rollback" and "patrol" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306)
[05:19:35] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:26:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:31:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:39:08] <jinxer-wm>	 (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:41:13] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:42:45] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:44:08] <jinxer-wm>	 (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:56:41] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:59:17] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0600)
[06:01:41] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:05:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3006.esams.wmnet with OS bullseye
[06:07:17] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db2190 [puppet] - 10https://gerrit.wikimedia.org/r/949400
[06:08:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2190 [puppet] - 10https://gerrit.wikimedia.org/r/949400 (owner: 10Marostegui)
[06:15:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:19:41] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:19:57] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: httpd-fcgi: de-quote unicode characters in logs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935)
[06:24:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3006.esams.wmnet with reason: host reimage
[06:25:29] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:27:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3006.esams.wmnet with reason: host reimage
[06:28:09] <taavi>	 jouncebot: nowandnext
[06:28:09] <jouncebot>	 For the next 0 hour(s) and 31 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0600)
[06:28:09] <jouncebot>	 In 0 hour(s) and 31 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0700)
[06:28:28] <taavi>	 is anyone using the mw infra window? can I run some maintenance scripts?
[06:33:10] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:34:10] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:36:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:36:15] <James_F>	 jouncebot: nowandnext
[06:36:15] <jouncebot>	 For the next 0 hour(s) and 23 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0600)
[06:36:15] <jouncebot>	 In 0 hour(s) and 23 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0700)
[06:36:21] <James_F>	 OK, deploying a fun patch.
[06:37:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948080 (owner: 10Giuseppe Lavagetto)
[06:38:56] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: add / to the route for wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948080 (owner: 10Giuseppe Lavagetto)
[06:39:18] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:39:29] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:948080|wikifunctions: add / to the route for wikifunctions]]
[06:41:05] <logmsgbot>	 !log jforrester@deploy1002 oblivian and jforrester: Backport for [[gerrit:948080|wikifunctions: add / to the route for wikifunctions]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[06:41:25] <logmsgbot>	 !log jforrester@deploy1002 oblivian and jforrester: Continuing with sync
[06:46:25] <wikibugs>	 (03CR) 10Stevemunene: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[06:46:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[06:47:04] <wikibugs>	 (03PS2) 10Jforrester: Enable url shortener in sidebar in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947823 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup)
[06:47:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:47:53] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:948080|wikifunctions: add / to the route for wikifunctions]] (duration: 08m 24s)
[06:52:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:56:22] <wikibugs>	 (03PS1) 10Zabe: Some initial configurations for blkwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949404 (https://phabricator.wikimedia.org/T344310)
[06:59:25] <zabe>	 jouncebot: nowandnext
[06:59:25] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0600)
[06:59:25] <jouncebot>	 In 0 hour(s) and 0 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0700)
[07:00:02] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: session-c64.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0700). Please do the needful.
[07:00:04] <jouncebot>	 Bas_dehaan, Daniuu, and Aca: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:12] <Aca>	 Confirming that I'm present :)
[07:00:27] <Daniuu>	 Present as well
[07:00:40] <Bas_dehaan>	 Also present 
[07:00:51] * taavi looks
[07:02:22] <Dreamy_Jazz>	 \o
[07:03:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[07:03:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3006.esams.wmnet with OS bullseye
[07:03:26] <taavi>	 Bas_dehaan: Daniuu: your patch needs a manual rebase
[07:03:45] <Daniuu>	 Bas_dehaan: doe jij dat even?
[07:03:55] <Bas_dehaan>	 I’ll take a look
[07:03:57] * urbanecm is currently calling-in to Wikimania, not positioned to deploy.
[07:04:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3008.esams.wmnet with OS bullseye
[07:04:15] <taavi>	 urbanecm: are you here on-site?
[07:04:54] <urbanecm>	 taavi: nope, didn't pass staff selection. calling in from home.
[07:04:57] <wikibugs>	 (03CR) 10Majavah: [C: 04-2] "This needs an on-wiki discussion, and a much stronger justification, due to the relatively small size of those groups (and the large overl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca)
[07:05:06] <taavi>	 ah :(
[07:06:03] <wikibugs>	 (03PS1) 10Dreamy Jazz: clienthints: Collect client hints on group1 wikis except two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949405 (https://phabricator.wikimedia.org/T341110)
[07:08:12] <wikibugs>	 (03CR) 10Acamicamacaraca: Add "editautopatrolprotected", "rollback" and "patrol" protection levels on shwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca)
[07:08:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949405 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz)
[07:08:35] <taavi>	 starting with Dreamy_Jazz's one
[07:08:49] <Dreamy_Jazz>	 Thanks!
[07:09:07] <wikibugs>	 (03Merged) 10jenkins-bot: clienthints: Collect client hints on group1 wikis except two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949405 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz)
[07:09:10] <zabe>	 taavi: :(
[07:09:35] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:949405|clienthints: Collect client hints on group1 wikis except two wikis (T341110)]]
[07:09:40] <stashbot>	 T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110
[07:11:21] <logmsgbot>	 !log taavi@deploy1002 taavi and dreamyjazz: Backport for [[gerrit:949405|clienthints: Collect client hints on group1 wikis except two wikis (T341110)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:14:33] <logmsgbot>	 !log taavi@deploy1002 taavi and dreamyjazz: Continuing with sync
[07:18:12] <taavi>	 Daniuu: Bas_dehaan: any news on the rebase?
[07:18:24] <Bas_dehaan>	 Working on it :)
[07:19:03] <Bas_dehaan>	 IDE had an update overnight and my git config got lost :(
[07:19:07] <Bas_dehaan>	 But fixing it
[07:21:13] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949405|clienthints: Collect client hints on group1 wikis except two wikis (T341110)]] (duration: 11m 38s)
[07:21:17] <stashbot>	 T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110
[07:21:30] <taavi>	 zabe: are you around for deploying your patch?
[07:21:34] <zabe>	 o/
[07:22:25] <wikibugs>	 (03PS2) 10Majavah: Some initial configurations for blkwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949404 (https://phabricator.wikimedia.org/T344310) (owner: 10Zabe)
[07:22:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949404 (https://phabricator.wikimedia.org/T344310) (owner: 10Zabe)
[07:23:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3008.esams.wmnet with reason: host reimage
[07:23:39] <wikibugs>	 (03Merged) 10jenkins-bot: Some initial configurations for blkwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949404 (https://phabricator.wikimedia.org/T344310) (owner: 10Zabe)
[07:24:06] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:949404|Some initial configurations for blkwiktionary (T344310)]]
[07:24:09] <stashbot>	 T344310: Initial configurations for blkwiktionary - https://phabricator.wikimedia.org/T344310
[07:25:37] <wikibugs>	 (03Abandoned) 10Stevemunene: Add datahub_staging cname [dns] - 10https://gerrit.wikimedia.org/r/946851 (https://phabricator.wikimedia.org/T343236) (owner: 10Stevemunene)
[07:25:44] <logmsgbot>	 !log taavi@deploy1002 taavi and zabe: Backport for [[gerrit:949404|Some initial configurations for blkwiktionary (T344310)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:26:18] <zabe>	 looks good
[07:26:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3008.esams.wmnet with reason: host reimage
[07:26:49] <logmsgbot>	 !log taavi@deploy1002 taavi and zabe: Continuing with sync
[07:30:32] <wikibugs>	 10SRE-OnFire, 10Incident Tooling, 10User-Joe: vopsbot incorrectly handles users with multiple teams - https://phabricator.wikimedia.org/T344316 (10Joe)
[07:30:54] <wikibugs>	 (03PS6) 10Bas dehaan: Added extended confirmed on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642)
[07:31:22] <Bas_dehaan>	 Just rebased
[07:31:55] <Daniuu>	 courtesy ping taavi :) 
[07:32:07] <Daniuu>	 Thanks, Bas
[07:32:19] <taavi>	 perfect, will deploy that once this one finishes
[07:33:24] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949404|Some initial configurations for blkwiktionary (T344310)]] (duration: 09m 17s)
[07:33:30] <stashbot>	 T344310: Initial configurations for blkwiktionary - https://phabricator.wikimedia.org/T344310
[07:33:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan)
[07:34:06] <zabe>	 taavi: Thanks :)
[07:34:21] <wikibugs>	 (03Merged) 10jenkins-bot: Added extended confirmed on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan)
[07:34:52] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:888736|Added extended confirmed on nlwiki (T329642)]]
[07:34:56] <stashbot>	 T329642: Implementing extended confirmed on nlwiki - https://phabricator.wikimedia.org/T329642
[07:36:32] <logmsgbot>	 !log taavi@deploy1002 bmdehaan and taavi: Backport for [[gerrit:888736|Added extended confirmed on nlwiki (T329642)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:36:55] <taavi>	 please test
[07:37:12] <Daniuu>	 taavi: doing
[07:38:14] <wikibugs>	 10SRE-OnFire, 10Incident Tooling, 10User-Joe: vopsbot incorrectly handles users with multiple teams - https://phabricator.wikimedia.org/T344316 (10Joe) p:05Triage→03Medium
[07:41:58] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 211, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:42:20] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:42:54] <taavi>	 Daniuu: how it is going?
[07:45:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[07:46:23] <Bas_dehaan>	 Just tested on my nlwiki notepad workes as intended
[07:46:28] <logmsgbot>	 !log taavi@deploy1002 bmdehaan and taavi: Continuing with sync
[07:48:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[07:48:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3008.esams.wmnet with OS bullseye
[07:52:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet
[07:53:07] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:888736|Added extended confirmed on nlwiki (T329642)]] (duration: 18m 15s)
[07:53:11] <stashbot>	 T329642: Implementing extended confirmed on nlwiki - https://phabricator.wikimedia.org/T329642
[07:54:25] <Daniuu>	 Seems to work as expected on production environment
[08:03:53] <taavi>	 !log mwscript extensions/OATHAuth/maintenance/UpdateForMultipleDevicesSupport.php techconductwiki # T242031
[08:03:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:57] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[08:07:05] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti3006.esams.wmnet
[08:11:39] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:14:24] <wikibugs>	 10SRE-OnFire, 10Incident Tooling, 10Patch-For-Review, 10User-Joe: vopsbot incorrectly handles users with multiple teams - https://phabricator.wikimedia.org/T344316 (10CodeReviewBot) oblivian opened https://gitlab.wikimedia.org/repos/sre/vopsbot/-/merge_requests/11  Allow users to be part of multiple teams
[08:16:13] <wikibugs>	 (03PS1) 10Muehlenhoff: confd: Explicitly require directory for systemd cleanup timer [puppet] - 10https://gerrit.wikimedia.org/r/949496
[08:16:39] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:26:13] <wikibugs>	 (03PS5) 10Jelto: gitlab: remove cas support [puppet] - 10https://gerrit.wikimedia.org/r/943563 (https://phabricator.wikimedia.org/T320390)
[08:27:18] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,session-c64.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:31:46] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42898/console" [puppet] - 10https://gerrit.wikimedia.org/r/943563 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto)
[08:33:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:36:20] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: remove cas support [puppet] - 10https://gerrit.wikimedia.org/r/943563 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto)
[08:38:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:38:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet
[08:40:39] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] toolforge: add deployer module with the secrets [puppet] - 10https://gerrit.wikimedia.org/r/948566 (https://phabricator.wikimedia.org/T334585) (owner: 10David Caro)
[08:42:29] <wikibugs>	 (03PS1) 10D3r1ck01: jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949178 (https://phabricator.wikimedia.org/T344223)
[08:47:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet
[08:48:34] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[08:48:52] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1099.eqiad.wmnet with OS bullseye
[08:50:42] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "we have several issues here, some related to the CR itself, some to acme-chief and some to our CI environment:" [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/948672 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[08:52:03] <wikibugs>	 (03CR) 10Jaime Nuche: "I'm not very clear about the details, but if we merge this change, won't T344238 still happen again eventually but with more stuck ssh con" [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto)
[08:53:25] <wikibugs>	 (03PS1) 10Urbanecm: mw-debug-repl: Add --verbose [puppet] - 10https://gerrit.wikimedia.org/r/949499 (https://phabricator.wikimedia.org/T344323)
[08:53:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm)
[08:55:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:55:27] <wikibugs>	 (03PS2) 10Urbanecm: mw-debug-repl: Add --verbose [puppet] - 10https://gerrit.wikimedia.org/r/949499 (https://phabricator.wikimedia.org/T344323)
[08:55:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM overall, just don't change limits for mw-debug." [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm)
[08:56:22] <wikibugs>	 (03PS2) 10Jforrester: jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949178 (https://phabricator.wikimedia.org/T344223) (owner: 10D3r1ck01)
[08:56:54] <wikibugs>	 (03CR) 10Jelto: "Yes" [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto)
[08:59:36] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] gerrit: raise maxConnectionsPerUser to 8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto)
[09:00:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-debug-repl: Add --verbose [puppet] - 10https://gerrit.wikimedia.org/r/949499 (https://phabricator.wikimedia.org/T344323) (owner: 10Urbanecm)
[09:00:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Apply ganeti role to ganeti3006/3008 [puppet] - 10https://gerrit.wikimedia.org/r/949500
[09:04:21] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1099.eqiad.wmnet with reason: host reimage
[09:04:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Apply ganeti role to ganeti3006/3008 [puppet] - 10https://gerrit.wikimedia.org/r/949500 (owner: 10Muehlenhoff)
[09:07:29] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1099.eqiad.wmnet with reason: host reimage
[09:08:53] <wikibugs>	 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): ManagementSSHDown - https://phabricator.wikimedia.org/T344110 (10jijiki) @Jhancock.wm thank you very much for the update!
[09:10:05] <wikibugs>	 10SRE-tools, 10Spicerack: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325 (10ayounsi)
[09:10:46] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325 (10ayounsi)
[09:10:50] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045 (10ayounsi)
[09:11:07] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325 (10ayounsi)
[09:11:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi)
[09:11:27] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325 (10ayounsi)
[09:12:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm some comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[09:14:36] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:15:42] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:17:44] <icinga-wm>	 RECOVERY - Host cr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 79.69 ms
[09:17:45] <icinga-wm>	 RECOVERY - Host cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 79.49 ms
[09:18:00] <icinga-wm>	 RECOVERY - Check systemd state on mw2330 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:18:16] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 58, down: 28, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:18:46] <icinga-wm>	 PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Idle - Telia, AS1299/IPv6: Idle - Telia, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:19:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:22:48] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:26:07] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] role::wmcs::monitoring: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949037 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro)
[09:27:43] <wikibugs>	 10SRE-tools, 10Spicerack: Junos module in Spicerack - https://phabricator.wikimedia.org/T344326 (10ayounsi) p:05Triage→03Low
[09:29:09] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1099.eqiad.wmnet with OS bullseye
[09:31:27] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[09:32:04] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1099 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:32:47] <_joe_>	 jouncebot: now and next
[09:32:47] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 27 minute(s)
[09:32:54] <_joe_>	 jouncebot: nowandnext
[09:32:54] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 27 minute(s)
[09:32:54] <jouncebot>	 In 0 hour(s) and 27 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1000)
[09:33:09] <_joe_>	 ok, I'll go on and merge the change to fix wikifunctions caching
[09:34:39] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change reverse for IPs on cr2-esams that had old cr3-knams in dns names - cmooney@cumin1001"
[09:35:10] <James_F>	 _joe_: Already deployed, sorry!
[09:35:27] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change reverse for IPs on cr2-esams that had old cr3-knams in dns names - cmooney@cumin1001"
[09:35:27] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:35:40] <logmsgbot>	 !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (eqiad): Cleanup stale config references
[09:35:57] <logmsgbot>	 !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (eqiad): Cleanup stale config references (duration: 00m 16s)
[09:37:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:37:22] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:40:26] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:41:10] <logmsgbot>	 !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (eqiad): Cleanup stale config references
[09:41:27] <logmsgbot>	 !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (eqiad): Cleanup stale config references (duration: 00m 17s)
[09:42:16] <logmsgbot>	 !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (eqiad): Cleanup stale config references
[09:42:25] <icinga-wm>	 PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100%
[09:43:00] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp3079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:43:00] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp3076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:43:22] <logmsgbot>	 !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (eqiad): Cleanup stale config references (duration: 01m 06s)
[09:44:18] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3009 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_80: Servers cp3063.esams.wmnet, cp3055.esams.wmnet, cp3061.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:44:39] <logmsgbot>	 !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references
[09:44:49] <icinga-wm>	 RECOVERY - Host cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 84.23 ms
[09:45:02] <logmsgbot>	 !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references (duration: 00m 23s)
[09:45:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur)
[09:45:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:46:04] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp3076 is OK: HTTP OK: HTTP/1.1 200 OK - 431 bytes in 0.160 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:46:04] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp3079 is OK: HTTP OK: HTTP/1.1 200 OK - 431 bytes in 0.160 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:49:18] <logmsgbot>	 !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references
[09:50:37] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 (owner: 10Jgiannelos)
[09:51:00] <wikibugs>	 (03PS17) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[09:51:42] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10Jelto)
[09:51:50] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 (owner: 10Jgiannelos)
[09:52:29] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) 05Open→03Resolved Cleanup of puppet code is done and most cas references are removed.  I'm not sure how to move forward...
[09:52:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Force image rebuild [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948609 (owner: 10Jgiannelos)
[09:52:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 (owner: 10Jgiannelos)
[09:54:12] <logmsgbot>	 !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references (duration: 04m 53s)
[09:54:31] <logmsgbot>	 !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references
[09:54:54] <logmsgbot>	 !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references (duration: 00m 23s)
[09:56:08] <logmsgbot>	 !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references
[09:56:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: istio: clarify instructions to get the istio version [deployment-charts] - 10https://gerrit.wikimedia.org/r/949501 (https://phabricator.wikimedia.org/T344253)
[09:56:32] <logmsgbot>	 !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references (duration: 00m 23s)
[09:56:53] <logmsgbot>	 !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references
[09:57:17] <logmsgbot>	 !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references (duration: 00m 23s)
[09:57:34] <logmsgbot>	 !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references
[09:57:58] <logmsgbot>	 !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references (duration: 00m 23s)
[09:58:20] <wikibugs>	 (03PS3) 10Jgiannelos: Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613
[09:58:43] <wikibugs>	 (03Abandoned) 10Jgiannelos: Force image rebuild [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948609 (owner: 10Jgiannelos)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1000)
[10:00:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:00:10] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1099 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:00:57] <wikibugs>	 (03PS1) 10Gehel: [WIP] Start Balzegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503
[10:01:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Start Balzegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (owner: 10Gehel)
[10:02:09] <MdsShakil>	 Congratulations @zabe :)
[10:02:29] <James_F>	 +1, very well-deserved zabe!
[10:02:30] <wikibugs>	 (03PS1) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/949179
[10:03:05] <wikibugs>	 (03PS2) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/949179
[10:03:25] <TheresNoTime>	 grats!
[10:06:38] <wikibugs>	 10SRE, 10Puppet-Core: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Clement_Goubert)
[10:08:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:09:09] <wikibugs>	 (03PS8) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056)
[10:09:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[10:10:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Clement_Goubert) Re-adding #SRE and #Infrastructure-Foundations since this is cross-SRE work under IF stewardship.
[10:10:31] <wikibugs>	 (03PS9) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056)
[10:11:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[10:11:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: aux: add tlsHostnames for jaeger collector and query [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253)
[10:12:14] <wikibugs>	 (03CR) 10Jbond: puppetserver: Add support for defining additional mount points (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[10:14:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'm following this: https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress" [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi)
[10:16:24] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[10:18:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Clement_Goubert)
[10:19:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, ok lets just go with this 😊" [puppet] - 10https://gerrit.wikimedia.org/r/949112 (https://phabricator.wikimedia.org/T344291) (owner: 10Bking)
[10:19:36] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:20:25] <wikibugs>	 (03PS4) 10Effie Mouzeli: Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos)
[10:20:51] <wikibugs>	 (03CR) 10Effie Mouzeli: Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos)
[10:21:04] <wikibugs>	 10SRE, 10Traffic: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10Vgutierrez)
[10:21:37] <wikibugs>	 (03Merged) 10jenkins-bot: Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos)
[10:23:47] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Swap esams ganeti0[1|2] cluster IPs due to subnet/rack mis-allocation - cmooney@cumin1001"
[10:24:17] <wikibugs>	 (03PS1) 10Vgutierrez: tests: fix CertificateState tests on python 3.10+ [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949505 (https://phabricator.wikimedia.org/T344330)
[10:24:33] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Swap esams ganeti0[1|2] cluster IPs due to subnet/rack mis-allocation - cmooney@cumin1001"
[10:24:33] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:24:56] <wikibugs>	 (03PS1) 10Jbond: confd: only rune cleanup command if directory exists [puppet] - 10https://gerrit.wikimedia.org/r/949506
[10:25:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet
[10:25:16] <wikibugs>	 (03CR) 10Jbond: "this lgtm however i think there is still some race conditions." [puppet] - 10https://gerrit.wikimedia.org/r/949496 (owner: 10Muehlenhoff)
[10:29:58] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10Vgutierrez)
[10:31:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:35:10] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10Vgutierrez) p:05Triage→03Medium
[10:36:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] Release 0.36-2 for Bookworm (031 comment) [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/948672 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[10:39:27] <fabfur>	 !log restarting haproxy service on all knams cp hosts to silence alerts
[10:39:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:44] <wikibugs>	 (03CR) 10Muehlenhoff: confd: Explicitly require directory for systemd cleanup timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949496 (owner: 10Muehlenhoff)
[10:45:00] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:45:18] <icinga-wm>	 PROBLEM - Host ganeti3006 is DOWN: PING CRITICAL - Packet loss = 100%
[10:45:19] <effie>	 !log depooling maps on codfw  - T344110
[10:45:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:22] <stashbot>	 T344110: maps2009 is unreachable - https://phabricator.wikimedia.org/T344110
[10:45:45] <wikibugs>	 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10jijiki)
[10:48:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:49:13] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10Vgutierrez) @hashar could you clarify if T342346 would trigger having python 3.11 on CI with some kind of backport for bullseye or do you have another task tracking python 3.11 suppo...
[10:50:00] <jinxer-wm>	 (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:51:31] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw
[10:52:42] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:55:21] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Clement_Goubert) @NHillard-WMF This request requires your manager's approval, @SCherukuwada if my information is up to date, as well as approval from @odimitrijevic or @...
[10:55:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Clement_Goubert) a:03NHillard-WMF
[11:06:25] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Clement_Goubert) 05Open→03Stalled
[11:06:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Clement_Goubert) 05Open→03Stalled
[11:07:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10Clement_Goubert) 05Open→03Stalled
[11:07:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10Clement_Goubert) 05Open→03Stalled
[11:08:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[11:12:05] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3006.mgmt.esams.wmnet with reboot policy GRACEFUL
[11:12:08] <wikibugs>	 (03PS1) 10Effie Mouzeli: tegola-vector-tiles: update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949507 (https://phabricator.wikimedia.org/T344324)
[11:13:50] <moritzm>	 !log imported 0.9.0-3~wmf12u1 for bookworm-wikimedia and 0.9.0-3~wmf11u1 for bullseye-wikimedia T340045
[11:13:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:54] <stashbot>	 T340045: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045
[11:15:26] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045 (10MoritzMuehlenhoff) I've uploaded dictdiffer for Bulleye and Bookworm (since we're likely about to move the Cumin servers to Bookworm in the not too distant future) to...
[11:15:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10SCherukuwada) Manager here: I approve.
[11:16:30] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Clement_Goubert)
[11:19:48] <urbanecm>	 jouncebot: nowandnext
[11:19:48] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 40 minute(s)
[11:19:48] <jouncebot>	 In 1 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1300)
[11:20:12] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3006.mgmt.esams.wmnet with reboot policy GRACEFUL
[11:20:51] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Clement_Goubert) Hi,  While we add your user to the base group, can you make sure you have:    - Read the [[ https://wikitech.wikimedia.org/wiki/Analytic...
[11:32:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Clement_Goubert) We would also need clarification on whether this request is also for SSH access or only via superset.
[11:36:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Clement_Goubert)
[11:40:11] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "-1 see inline.  this will also need approval fone of the wmde engineering managers[1]" [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[11:41:12] <wikibugs>	 (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[11:44:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/949496 (owner: 10Muehlenhoff)
[11:45:22] <wikibugs>	 (03CR) 10Jbond: "the problem this cr tries to fix is better fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/949496.  As mentioned on that tas" [puppet] - 10https://gerrit.wikimedia.org/r/949506 (owner: 10Jbond)
[11:45:26] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti3006.esams.wmnet
[11:45:30] <wikibugs>	 (03Abandoned) 10Jbond: confd: only rune cleanup command if directory exists [puppet] - 10https://gerrit.wikimedia.org/r/949506 (owner: 10Jbond)
[11:46:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:48:50] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] istio: clarify instructions to get the istio version [deployment-charts] - 10https://gerrit.wikimedia.org/r/949501 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi)
[11:49:24] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:49:28] <icinga-wm>	 RECOVERY - Host ganeti3006 is UP: PING OK - Packet loss = 0%, RTA = 78.10 ms
[11:51:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:52:25] <wikibugs>	 (03PS6) 10JMeybohm: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978)
[11:52:32] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:53:14] <icinga-wm>	 PROBLEM - HTTPS Ganeti RAPI esams on ganeti3006 is CRITICAL: connect to address ganeti02.svc.esams.wmnet and port 5080: No route to host https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon
[11:54:33] <wikibugs>	 (03PS7) 10JMeybohm: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978)
[11:54:35] <wikibugs>	 (03PS3) 10JMeybohm: Remove limits in ResourceQuota and container limitanges for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978)
[11:54:41] <wikibugs>	 (03CR) 10JMeybohm: Remove limits in ResourceQuota and container limitanges for mediawiki (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm)
[11:55:23] <wikibugs>	 (03PS1) 10Urbanecm: revalidateLinkRecommendations: Make it possible to revalidate tasks older than [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949180 (https://phabricator.wikimedia.org/T344034)
[11:55:38] <wikibugs>	 (03PS1) 10Urbanecm: revalidateLinkRecommendations: Make it possible to revalidate tasks older than [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949181 (https://phabricator.wikimedia.org/T344034)
[11:55:44] <urbanecm>	 jouncebot: nowandnext
[11:55:44] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 4 minute(s)
[11:55:44] <jouncebot>	 In 1 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1300)
[11:56:01] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] revalidateLinkRecommendations: Make it possible to revalidate tasks older than [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949180 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm)
[11:56:07] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] revalidateLinkRecommendations: Make it possible to revalidate tasks older than [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949181 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm)
[12:03:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:08:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet
[12:16:03] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1100.eqiad.wmnet with OS bullseye
[12:17:19] <wikibugs>	 (03Merged) 10jenkins-bot: revalidateLinkRecommendations: Make it possible to revalidate tasks older than [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949180 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm)
[12:17:21] <wikibugs>	 (03Merged) 10jenkins-bot: revalidateLinkRecommendations: Make it possible to revalidate tasks older than [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949181 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm)
[12:18:09] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949180|revalidateLinkRecommendations: Make it possible to revalidate tasks older than (T344034)]], [[gerrit:949181|revalidateLinkRecommendations: Make it possible to revalidate tasks older than (T344034)]]
[12:18:13] <stashbot>	 T344034: ruwiki: Too many AddLink suggestions were generated before 'excludedSections' rule was introduced - https://phabricator.wikimedia.org/T344034
[12:19:56] <icinga-wm>	 RECOVERY - HTTPS Ganeti RAPI esams on ganeti3006 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.015 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon
[12:20:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27
[12:20:52] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27
[12:20:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet
[12:22:38] <icinga-wm>	 PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2023-08-19 04:23:22 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:22:52] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949507 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli)
[12:24:14] <icinga-wm>	 RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2023-10-18 03:52:32 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:26:19] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3005.mgmt.esams.wmnet with reboot policy GRACEFUL
[12:26:30] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949180|revalidateLinkRecommendations: Make it possible to revalidate tasks older than (T344034)]], [[gerrit:949181|revalidateLinkRecommendations: Make it possible to revalidate tasks older than (T344034)]] (duration: 08m 20s)
[12:26:33] <stashbot>	 T344034: ruwiki: Too many AddLink suggestions were generated before 'excludedSections' rule was introduced - https://phabricator.wikimedia.org/T344034
[12:26:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:26:42] * urbanecm done
[12:28:02] <icinga-wm>	 PROBLEM - Host ganeti3005 is DOWN: PING CRITICAL - Packet loss = 100%
[12:28:42] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3006.mgmt.esams.wmnet with reboot policy GRACEFUL
[12:29:13] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3007.mgmt.esams.wmnet with reboot policy GRACEFUL
[12:29:49] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3008.mgmt.esams.wmnet with reboot policy GRACEFUL
[12:30:39] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1100.eqiad.wmnet with reason: host reimage
[12:31:15] <icinga-wm>	 PROBLEM - Host ganeti3006 is DOWN: PING CRITICAL - Packet loss = 100%
[12:31:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:31:41] <icinga-wm>	 PROBLEM - Host ganeti3007 is DOWN: PING CRITICAL - Packet loss = 100%
[12:32:11] <icinga-wm>	 PROBLEM - Host ganeti3008 is DOWN: PING CRITICAL - Packet loss = 100%
[12:33:54] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1100.eqiad.wmnet with reason: host reimage
[12:34:12] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3005.mgmt.esams.wmnet with reboot policy GRACEFUL
[12:34:13] <wikibugs>	 (03PS3) 10Sergio Gimeno: GrowthExperiments: enable add a link in 11th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136)
[12:34:30] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949507 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli)
[12:35:16] <wikibugs>	 (03Merged) 10jenkins-bot: tegola-vector-tiles: update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949507 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli)
[12:36:04] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3006.mgmt.esams.wmnet with reboot policy GRACEFUL
[12:36:32] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply
[12:37:05] <icinga-wm>	 RECOVERY - Host ganeti3006 is UP: PING OK - Packet loss = 0%, RTA = 80.50 ms
[12:37:05] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3007.mgmt.esams.wmnet with reboot policy GRACEFUL
[12:37:09] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Temporarily disable link-recommendation frontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949510 (https://phabricator.wikimedia.org/T344034)
[12:37:11] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3008.mgmt.esams.wmnet with reboot policy GRACEFUL
[12:37:24] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply
[12:37:32] <urbanecm>	 jouncebot: nowandnext
[12:37:32] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 22 minute(s)
[12:37:32] <jouncebot>	 In 0 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1300)
[12:37:48] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "to be able to run revalidateLinkRecommendations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949510 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm)
[12:37:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949510 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm)
[12:37:57] <icinga-wm>	 RECOVERY - Host ganeti3008 is UP: PING OK - Packet loss = 0%, RTA = 81.30 ms
[12:39:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27
[12:39:43] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27
[12:42:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet
[12:44:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/949179 (owner: 10Jbond)
[12:46:04] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Temporarily disable link-recommendation frontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949510 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm)
[12:46:30] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949510|Growth: Temporarily disable link-recommendation frontend (T344034)]]
[12:46:34] <stashbot>	 T344034: ruwiki: Too many AddLink suggestions were generated before 'excludedSections' rule was introduced - https://phabricator.wikimedia.org/T344034
[12:48:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] istio: clarify instructions to get the istio version [deployment-charts] - 10https://gerrit.wikimedia.org/r/949501 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi)
[12:48:43] <icinga-wm>	 RECOVERY - Host ganeti3005 is UP: PING OK - Packet loss = 0%, RTA = 80.57 ms
[12:48:57] <wikibugs>	 10SRE, 10Traffic, 10observability, 10Upstream: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10RhinosF1) This happened again for lists1001. Requested (and it has been) restart in #wikimedia-sre
[12:49:15] <icinga-wm>	 RECOVERY - Host ganeti3007 is UP: PING OK - Packet loss = 0%, RTA = 80.07 ms
[12:49:42] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/949179 (owner: 10Jbond)
[12:50:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet
[12:51:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27
[12:51:25] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:51:31] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:52:00] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27
[12:53:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27
[12:53:40] <wikibugs>	 (03PS1) 10Anzx: Add blkwiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949512 (https://phabricator.wikimedia.org/T344310)
[12:53:53] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27
[12:54:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:54:34] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949510|Growth: Temporarily disable link-recommendation frontend (T344034)]] (duration: 08m 04s)
[12:54:38] <stashbot>	 T344034: ruwiki: Too many AddLink suggestions were generated before 'excludedSections' rule was introduced - https://phabricator.wikimedia.org/T344034
[12:55:05] <godog>	 dcaro: I86bdfba3e broke puppet on a bunch of hosts, including prometheus
[12:55:17] <godog>	 Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while e match for Wmflib::Ensure = Enum['absent', 'present'], got 'file' (file: /etc/puppet/modules/profile/functions/pki/get_c
[12:55:33] <wikibugs>	 (03PS4) 10Sergio Gimeno: GrowthExperiments: enable add a link in 11th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136)
[12:58:02] <urbanecm>	 !log mwscript extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --wiki=ruwiki --olderThan=1651960800 --verbose # T344034
[12:58:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:06] <wikibugs>	 (03PS1) 10Anzx: add suwikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949513 (https://phabricator.wikimedia.org/T344314)
[12:59:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27
[12:59:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] service::catalog: Add config-master to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/948560 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[12:59:30] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27
[12:59:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:59:47] <godog>	 dcaro: going to revert for now since I'm guessing you are at lunch
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1300).
[13:00:05] <jouncebot>	 sergi0, xSavitar, Krinkle, MichaelG_WMDE, and Urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:25] <sergi0>	 hello
[13:00:36] <Krinkle>	 o/
[13:01:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "role::wmcs::monitoring: pass through the ensure option" [puppet] - 10https://gerrit.wikimedia.org/r/949182
[13:01:14] <wikibugs>	 (03CR) 10Gehel: query_service: let puppet manage whitelist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking)
[13:02:24] <urbanecm>	 Krinkle: do you want to start deploying your patch? :)
[13:02:24] * MichaelG_WMDE is here
[13:03:12] <wikibugs>	 (03PS1) 10Jbond: service::catalog: correct discovery value [puppet] - 10https://gerrit.wikimedia.org/r/949514 (https://phabricator.wikimedia.org/T341717)
[13:03:23] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] service::catalog: correct discovery value [puppet] - 10https://gerrit.wikimedia.org/r/949514 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[13:03:46] <Krinkle>	 urbanecm: go ahead with yours if you like
[13:04:04] <urbanecm>	 Ill be fully available in a few :)
[13:04:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "role::wmcs::monitoring: pass through the ensure option" [puppet] - 10https://gerrit.wikimedia.org/r/949182 (owner: 10Filippo Giunchedi)
[13:05:11] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Revert "role::wmcs::monitoring: pass through the ensure option" [puppet] - 10https://gerrit.wikimedia.org/r/949182
[13:06:03] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:06:12] <wikibugs>	 (03PS2) 10Jbond: config-master: add new discovery record for config-master [dns] - 10https://gerrit.wikimedia.org/r/948562 (https://phabricator.wikimedia.org/T341717)
[13:06:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] config-master: add new discovery record for config-master [dns] - 10https://gerrit.wikimedia.org/r/948562 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[13:07:24] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm)
[13:07:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "role::wmcs::monitoring: pass through the ensure option" [puppet] - 10https://gerrit.wikimedia.org/r/949182 (owner: 10Filippo Giunchedi)
[13:10:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:13:23] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[13:14:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMDE and LDAP/NDA for mareikeheuer - https://phabricator.wikimedia.org/T344341 (10MareikeHeuerWMDE)
[13:14:49] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:17:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136) (owner: 10Sergio Gimeno)
[13:17:40] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/949179 (owner: 10Jbond)
[13:17:45] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: enable add a link in 11th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136) (owner: 10Sergio Gimeno)
[13:18:16] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:948094|GrowthExperiments: enable add a link in 11th round of wikis (T308136)]]
[13:18:23] <stashbot>	 T308136: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136
[13:19:54] <logmsgbot>	 !log urbanecm@deploy1002 sgimeno and urbanecm: Backport for [[gerrit:948094|GrowthExperiments: enable add a link in 11th round of wikis (T308136)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:20:17] <sergi0>	 testing now
[13:20:26] <urbanecm>	 thanks
[13:21:03] <wikibugs>	 10sre-alert-triage, 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10Gehel) 05Open→03Resolved a:03Gehel
[13:21:44] <wikibugs>	 (03PS1) 10Jbond: trafficserver: update config-master to use discovery record [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717)
[13:22:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] trafficserver: update config-master to use discovery record [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[13:22:20] <wikibugs>	 (03PS1) 10Stevemunene: datahub: Enable OIDC to idp_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874)
[13:22:47] <sergi0>	 I tested 4 wikis, things looking fine on my end
[13:23:16] <logmsgbot>	 !log urbanecm@deploy1002 sgimeno and urbanecm: Continuing with sync
[13:23:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42901/console" [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[13:24:26] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Add warning when provision cookbook is ran without the virtualization flag on hypervisors - https://phabricator.wikimedia.org/T344342 (10ayounsi)
[13:25:40] <wikibugs>	 (03PS2) 10Jbond: trafficserver: update config-master to use discovery record [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717)
[13:25:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMDE and LDAP/NDA for mareikeheuer - https://phabricator.wikimedia.org/T344341 (10Clement_Goubert) 05Open→03In progress Hi,  In order to process your access request, I'm going to need @KFrancis to process your NDA email: mareike.heuer@wikimedia.de as well...
[13:26:01] <_joe_>	 !incidents
[13:26:01] <sirenbot>	 3949 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr2-eqiad.wikimedia.org)
[13:26:14] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Add warning when provision cookbook is ran without the virtualization flag on hypervisors - https://phabricator.wikimedia.org/T344342 (10cmooney) ganeti* and cloudvirt* for sure it'd make sense to have this for
[13:26:23] <wikibugs>	 (03PS3) 10Urbanecm: testwikidatawiki: always show MUL in Termbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949030 (https://phabricator.wikimedia.org/T343409) (owner: 10Michael Große)
[13:26:43] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949178 (https://phabricator.wikimedia.org/T344223) (owner: 10D3r1ck01)
[13:26:50] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] testwikidatawiki: always show MUL in Termbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949030 (https://phabricator.wikimedia.org/T343409) (owner: 10Michael Große)
[13:27:19] * MichaelG_WMDE is ready to test whenever you are :)
[13:27:32] <wikibugs>	 (03Merged) 10jenkins-bot: testwikidatawiki: always show MUL in Termbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949030 (https://phabricator.wikimedia.org/T343409) (owner: 10Michael Große)
[13:28:03] <urbanecm>	 MichaelG_WMDE: will ping you :)
[13:28:13] <MichaelG_WMDE>	 👍
[13:28:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27
[13:29:06] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27
[13:29:48] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:948094|GrowthExperiments: enable add a link in 11th round of wikis (T308136)]] (duration: 11m 32s)
[13:29:52] <stashbot>	 T308136: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136
[13:30:12] <urbanecm>	 sergi0: should be live :)
[13:30:37] <sergi0>	 urbanecm: 🎉 ty!
[13:30:40] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949030|testwikidatawiki: always show MUL in Termbox (T343409)]]
[13:30:41] <urbanecm>	 np
[13:30:46] <stashbot>	 T343409: MUL - Configure Test Wikidata to full-rollout mode - https://phabricator.wikimedia.org/T343409
[13:32:22] <logmsgbot>	 !log urbanecm@deploy1002 migr and urbanecm: Backport for [[gerrit:949030|testwikidatawiki: always show MUL in Termbox (T343409)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:32:39] <urbanecm>	 MichaelG_WMDE: please test!
[13:33:00] * MichaelG_WMDE tests
[13:33:41] <MichaelG_WMDE>	 urbanecm: It works, thank you!
[13:33:46] <urbanecm>	 great, proceeding
[13:33:50] <logmsgbot>	 !log urbanecm@deploy1002 migr and urbanecm: Continuing with sync
[13:35:27] <dcaro>	 godog: thanks, doctor appointment, weird :/, pcc was clear iirc, I'll recheck when I'm back
[13:36:12] <wikibugs>	 (03PS2) 10Urbanecm: Add blkwiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949512 (https://phabricator.wikimedia.org/T344310) (owner: 10Anzx)
[13:36:17] <wikibugs>	 (03PS2) 10Urbanecm: add suwikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949513 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[13:36:21] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add blkwiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949512 (https://phabricator.wikimedia.org/T344310) (owner: 10Anzx)
[13:36:23] <godog>	 dcaro: sure take your time, all is well after the revert, I don't think PCC can catch these failures
[13:36:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti02 cluster for esams [puppet] - 10https://gerrit.wikimedia.org/r/949517
[13:36:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] add suwikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949513 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[13:36:33] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+2] Add dr0ptp4kt to wmcs-admin [puppet] - 10https://gerrit.wikimedia.org/r/947028 (https://phabricator.wikimedia.org/T343862) (owner: 10Dr0ptp4kt)
[13:37:19] <wikibugs>	 (03Merged) 10jenkins-bot: Add blkwiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949512 (https://phabricator.wikimedia.org/T344310) (owner: 10Anzx)
[13:37:26] <wikibugs>	 (03Merged) 10jenkins-bot: add suwikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949513 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[13:38:08] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add ganeti02 cluster for esams [puppet] - 10https://gerrit.wikimedia.org/r/949517 (owner: 10Muehlenhoff)
[13:40:16] <wikibugs>	 (03PS2) 10Urbanecm: Growth: Enable new Impact backend on large Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949033 (https://phabricator.wikimedia.org/T344143)
[13:40:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth: Enable new Impact backend on large Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949033 (https://phabricator.wikimedia.org/T344143) (owner: 10Urbanecm)
[13:40:23] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949030|testwikidatawiki: always show MUL in Termbox (T343409)]] (duration: 09m 43s)
[13:40:29] <stashbot>	 T343409: MUL - Configure Test Wikidata to full-rollout mode - https://phabricator.wikimedia.org/T343409
[13:41:01] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949512|Add blkwiktionary logo (T344310)]], [[gerrit:949513|add suwikisource logo (T344314)]]
[13:41:06] <stashbot>	 T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314
[13:41:06] <stashbot>	 T344310: Initial configurations for blkwiktionary - https://phabricator.wikimedia.org/T344310
[13:41:18] <urbanecm>	 aanzx: your patch's next
[13:41:23] <aanzx>	 Ok
[13:42:38] <wikibugs>	 (03PS1) 10Btullis: Remove the override for rocm version for buster hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/949518 (https://phabricator.wikimedia.org/T332570)
[13:42:40] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:949512|Add blkwiktionary logo (T344310)]], [[gerrit:949513|add suwikisource logo (T344314)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:42:50] <aanzx>	 Testing 
[13:43:23] <urbanecm>	 thanks
[13:43:39] <MichaelG_WMDE>	 I now also see my changes on the live servers. Thanks! 🎉
[13:43:40] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949178 (https://phabricator.wikimedia.org/T344223) (owner: 10D3r1ck01)
[13:43:43] <urbanecm>	 MichaelG_WMDE: awesome
[13:43:46] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1101.eqiad.wmnet with OS bullseye
[13:44:07] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove the override for rocm version for buster hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/949518 (https://phabricator.wikimedia.org/T332570) (owner: 10Btullis)
[13:44:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti02 cluster for esams [puppet] - 10https://gerrit.wikimedia.org/r/949517 (owner: 10Muehlenhoff)
[13:44:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10joanna_borun) Approved
[13:44:21] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] devices: add anycast_ and lvs_neigbhors for esams (bw27/by27) [homer/public] - 10https://gerrit.wikimedia.org/r/949100 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:44:23] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:44:52] <aanzx>	 urbanecm: both logos are good 
[13:44:57] <urbanecm>	 great, proceeding
[13:44:58] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and anzx: Continuing with sync
[13:46:38] <aanzx>	 urbanecm: there is one patch i have added to calendar would it be merged now , or I should reschedule it for later
[13:46:44] <sukhe>	 !log running homer on asw1-b*27-esams* for CR 949100: T329219
[13:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:48] <stashbot>	 T329219: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219
[13:47:02] <urbanecm>	 aanzx: we're running very short on time, please reschedule for later.
[13:47:19] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.ganeti.makevm: Add esams to list of DC with per-rack VLANs [cookbooks] - 10https://gerrit.wikimedia.org/r/949519
[13:47:20] <aanzx>	 Ok thanks, will reschedule it 
[13:48:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth: Enable new Impact backend on large Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949033 (https://phabricator.wikimedia.org/T344143) (owner: 10Urbanecm)
[13:48:29] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sre.ganeti.makevm: Add esams to list of DC with per-rack VLANs [cookbooks] - 10https://gerrit.wikimedia.org/r/949519 (owner: 10Muehlenhoff)
[13:49:04] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Enable new Impact backend on large Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949033 (https://phabricator.wikimedia.org/T344143) (owner: 10Urbanecm)
[13:50:20] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] esams/ntp: point to dns3003 [dns] - 10https://gerrit.wikimedia.org/r/949113 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:50:24] <wikibugs>	 (03PS2) 10Ssingh: esams/ntp: point to dns3003 [dns] - 10https://gerrit.wikimedia.org/r/949113 (https://phabricator.wikimedia.org/T329219)
[13:50:46] <wikibugs>	 (03PS1) 10Jbond: config_master: remove_default_ports and add modules [puppet] - 10https://gerrit.wikimedia.org/r/949521 (https://phabricator.wikimedia.org/T341717)
[13:51:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix profile title [puppet] - 10https://gerrit.wikimedia.org/r/949522
[13:51:38] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949512|Add blkwiktionary logo (T344310)]], [[gerrit:949513|add suwikisource logo (T344314)]] (duration: 10m 37s)
[13:51:43] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2] esams/ntp: point to dns3003 [dns] - 10https://gerrit.wikimedia.org/r/949113 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:51:43] <stashbot>	 T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314
[13:51:44] <stashbot>	 T344310: Initial configurations for blkwiktionary - https://phabricator.wikimedia.org/T344310
[13:51:57] <urbanecm>	 aanzx: should be live.
[13:51:59] <sukhe>	 !log running authdns-update for CR 949113
[13:52:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:52:19] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949178|jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses (T344223 T343291)]], [[gerrit:949033|Growth: Enable new Impact backend on large Wikipedias (T344143)]]
[13:52:20] <urbanecm>	 Krinkle: xSavitar: starting scap for your backport now.
[13:52:25] <Krinkle>	 ack
[13:52:28] <stashbot>	 T344143: New Impact module: Run backend updating logic on all Wikipedias - https://phabricator.wikimedia.org/T344143
[13:52:28] <stashbot>	 T344223: User logging in on mw-on-k8s triggers "RuntimeException: firejail is enabled, but cannot be found" - https://phabricator.wikimedia.org/T344223
[13:52:28] <stashbot>	 T343291: [betacluster] Cannot login - UserLogin RuntimeException: Failed to run getConfiguration.php - https://phabricator.wikimedia.org/T343291
[13:52:31] <sukhe>	 !log restart pybal on lvs3008
[13:52:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Fix profile title [puppet] - 10https://gerrit.wikimedia.org/r/949522 (owner: 10Muehlenhoff)
[13:52:47] <aanzx>	 urbanecm: it is live , thanks 
[13:53:22] <urbanecm>	 np
[13:53:55] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and d3r1ck01: Backport for [[gerrit:949178|jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses (T344223 T343291)]], [[gerrit:949033|Growth: Enable new Impact backend on large Wikipedias (T344143)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessibl
[13:53:55] <logmsgbot>	 e via k8s-experimental XWD option)
[13:54:16] <urbanecm>	 Krinkle: can you test?
[13:56:04] <wikibugs>	 (03CR) 10JHathaway: puppetserver: Add support for defining additional mount points (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[13:56:11] <Krinkle>	 testing..
[13:56:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix profile title [puppet] - 10https://gerrit.wikimedia.org/r/949522 (owner: 10Muehlenhoff)
[13:56:39] <wikibugs>	 (03PS1) 10Ssingh: common: update ntp_servers with dns300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/949525 (https://phabricator.wikimedia.org/T329219)
[13:56:39] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:57:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] common: update ntp_servers with dns300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/949525 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:57:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.makevm: Add esams to list of DC with per-rack VLANs [cookbooks] - 10https://gerrit.wikimedia.org/r/949519 (owner: 10Muehlenhoff)
[13:58:41] <wikibugs>	 (03PS2) 10Ssingh: common: update ntp_servers with dns300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/949525 (https://phabricator.wikimedia.org/T329219)
[13:59:53] <Krinkle>	 urbanecm: LGTM
[13:59:59] <urbanecm>	 ty, proceeding
[14:00:02] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and d3r1ck01: Continuing with sync
[14:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1400)
[14:00:20] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1101.eqiad.wmnet with reason: host reimage
[14:00:53] <urbanecm>	 actually... Krinkle: i just saw `Object of class MediaWiki\JobQueue\JobQueueGroupFactory could not be converted to string` in logs. sounds like an issue to me?
[14:01:17] <Krinkle>	 urbanecm: that's me messing about on eval.php
[14:01:19] <urbanecm>	 ah
[14:01:45] <urbanecm>	 missed that detail. continuing :)
[14:03:31] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1101.eqiad.wmnet with reason: host reimage
[14:04:05] <wikibugs>	 (03PS1) 10Jbond: release: add additional instructions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/949527
[14:04:49] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMDE and LDAP/NDA for mareikeheuer - https://phabricator.wikimedia.org/T344341 (10Tobi_WMDE_SW) >>! In T344341#9096283, @Clement_Goubert wrote: > Hi, >  > In order to process your access request, I'm going to need @KFrancis to process your NDA (email: mareike...
[14:05:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host netflow3003.esams.wmnet
[14:05:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:05:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] config_master: remove_default_ports and add modules [puppet] - 10https://gerrit.wikimedia.org/r/949521 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond)
[14:06:32] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949178|jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses (T344223 T343291)]], [[gerrit:949033|Growth: Enable new Impact backend on large Wikipedias (T344143)]] (duration: 14m 13s)
[14:06:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:06:39] <stashbot>	 T344143: New Impact module: Run backend updating logic on all Wikipedias - https://phabricator.wikimedia.org/T344143
[14:06:39] <stashbot>	 T344223: User logging in on mw-on-k8s triggers "RuntimeException: firejail is enabled, but cannot be found" - https://phabricator.wikimedia.org/T344223
[14:06:40] <stashbot>	 T343291: [betacluster] Cannot login - UserLogin RuntimeException: Failed to run getConfiguration.php - https://phabricator.wikimedia.org/T343291
[14:06:46] <urbanecm>	 should all be live.
[14:07:05] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Growth: Temporarily disable link-recommendation frontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949184 (https://phabricator.wikimedia.org/T344034)
[14:07:14] <wikibugs>	 (03PS2) 10Urbanecm: Revert "Growth: Temporarily disable link-recommendation frontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949184 (https://phabricator.wikimedia.org/T344034)
[14:07:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "Growth: Temporarily disable link-recommendation frontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949184 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm)
[14:08:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Growth: Temporarily disable link-recommendation frontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949184 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm)
[14:09:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove bastion role from bast3006 (will be replaced by bast3007) [puppet] - 10https://gerrit.wikimedia.org/r/949528
[14:09:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow3003.esams.wmnet - jmm@cumin2002"
[14:10:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow3003.esams.wmnet - jmm@cumin2002"
[14:10:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:10:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow3003.esams.wmnet on all recursors
[14:10:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow3003.esams.wmnet on all recursors
[14:10:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:10:40] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949184|Revert "Growth: Temporarily disable link-recommendation frontend" (T344034)]]
[14:10:44] <stashbot>	 T344034: ruwiki: Too many AddLink suggestions were generated before 'excludedSections' rule was introduced - https://phabricator.wikimedia.org/T344034
[14:11:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove bastion role from bast3006 (will be replaced by bast3007) [puppet] - 10https://gerrit.wikimedia.org/r/949528 (owner: 10Muehlenhoff)
[14:11:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:11:38] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:12:23] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[14:12:27] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:949184|Revert "Growth: Temporarily disable link-recommendation frontend" (T344034)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[14:12:49] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host netflow3003.esams.wmnet
[14:13:02] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Continuing with sync
[14:15:30] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[14:16:38] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:16:54] <wikibugs>	 (03PS1) 10Ssingh: hiera: LVS: update tagged_subnets for esams [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219)
[14:16:57] <wikibugs>	 (03PS1) 10David Caro: p:tlsproxy::envoy: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949530 (https://phabricator.wikimedia.org/T344242)
[14:17:11] <wikibugs>	 (03CR) 10Thcipriani: release: add additional instructions (032 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/949527 (owner: 10Jbond)
[14:17:45] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42903/console" [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[14:17:49] <dcaro>	 godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/949530 should be the new patch, pcc would have caught it but I only tested with the hosts that have 'ensure' set to false, so the default branch did not get tested
[14:18:17] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC failing because of missing facts, let me update them." [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[14:18:50] <sukhe>	 !log sukhe@puppetmaster1001:~$ sudo /usr/local/sbin/puppet-facts-upload --proxy http://webproxy.eqiad.wmnet:8080
[14:18:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:19:47] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949184|Revert "Growth: Temporarily disable link-recommendation frontend" (T344034)]] (duration: 09m 06s)
[14:19:50] <stashbot>	 T344034: ruwiki: Too many AddLink suggestions were generated before 'excludedSections' rule was introduced - https://phabricator.wikimedia.org/T344034
[14:19:54] <urbanecm>	 done all
[14:20:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10Clement_Goubert) 05Open→03In progress
[14:20:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10Clement_Goubert) 05Open→03In progress
[14:22:28] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10Clement_Goubert) 05Open→03In progress
[14:23:41] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1100.eqiad.wmnet with OS bullseye
[14:24:10] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+2] Add Nicholas as approver for wmcs-admin [puppet] - 10https://gerrit.wikimedia.org/r/947319 (owner: 10Muehlenhoff)
[14:24:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host netflow3003.esams.wmnet
[14:24:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:24:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:25:04] <sukhe>	 !log ssh pcc-db1001.puppet-diffs.eqiad1.wikimedia.cloud sudo -u jenkins-deploy /usr/local/sbin/pcc_facts_processor
[14:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:48] <urbanecm>	 claime: hi, can i bother you to do `systemctl start mediawiki_job_growthexperiments-userImpactUpdateRecentlyEdited` and `systemctl start mediawiki_job_growthexperiments-userImpactUpdateRecentlyRegistered` on mwmaint1002 please? i enabled the jobs on a couple of additional wikis and i'd like to observe how well they cope with the added work. thanks!
[14:25:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:25:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow3003.esams.wmnet on all recursors
[14:25:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow3003.esams.wmnet on all recursors
[14:26:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow3003.esams.wmnet - jmm@cumin2002"
[14:26:59] <claime>	 urbanecm: sure thing
[14:27:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow3003.esams.wmnet - jmm@cumin2002"
[14:27:13] <urbanecm>	 ty. task is T344143, if you need that info :)
[14:27:13] <stashbot>	 T344143: New Impact module: Run backend updating logic on all Wikipedias - https://phabricator.wikimedia.org/T344143
[14:28:02] <icinga-wm>	 PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,session-c64.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow3003.esams.wmnet with OS bookworm
[14:29:18] <claime>	 urbanecm: Launched, I suppose it's normal they're not giving control back, they're supposed to run front?
[14:29:19] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T344353 (10phaultfinder)
[14:30:04] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1101.eqiad.wmnet with OS bullseye
[14:31:01] <urbanecm>	 claime: good question, i'm not privy to systemctl commands, so I'm not sure how they behave. but i see the job's running now.
[14:31:36] <claime>	 yep, it's not a problem if they don't run in the background, I launched them in a tmux
[14:31:58] <claime>	 are they supposed to be run on a timer usually or something ?
[14:33:26] <claime>	 Yeah, according to puppet they're periodic jobs
[14:33:50] <urbanecm>	 yeah, it's a timer-based job. i wanted them to run now, so i can better monitor for possible issues, given i enabled them for our biggest wikis today.
[14:34:09] <claime>	 urbanecm: ack. mediawiki_job_growthexperiments-userImpactUpdateRecentlyRegistered just finished running
[14:34:18] <urbanecm>	 great, thanks. 
[14:34:33] <claime>	 And so did mediawiki_job_growthexperiments-userImpactUpdateRecentlyEdited
[14:35:00] <urbanecm>	 thanks again. logs look good so far. i'll monitor logstash for a bit.
[14:35:00] <wikibugs>	 (03PS3) 10Hnowlan: deployment_server: add new service geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947862 (https://phabricator.wikimedia.org/T336400)
[14:36:47] <wikibugs>	 (03PS3) 10Hnowlan: WIP helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400)
[14:38:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:39:52] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[14:41:00] <icinga-wm>	 RECOVERY - config-master.wikimedia.org requires authentication on config-master1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:41:14] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:41:57] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] common: update ntp_servers with dns300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/949525 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[14:42:40] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:43:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:49:58] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff)
[14:50:00] <jinxer-wm>	 (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:51:06] <wikibugs>	 (03PS1) 10Ssingh: P:pybal: update bgp-peer-address for asw1-b*27-esams [puppet] - 10https://gerrit.wikimedia.org/r/949531 (https://phabricator.wikimedia.org/T329219)
[14:51:13] <wikibugs>	 10SRE, 10Observability-Alerting: Missing 'notify' for some Icinga configuration files - https://phabricator.wikimedia.org/T263027 (10fgiunchedi) While investigating the ns2-v4 not being removed (cc @andrea.denisse @ssingh ) today, this is the log:  ` Aug 11 16:28:31 alert1001 puppet-agent[4665]: Applying confi...
[14:51:41] <jelto>	 !log registry* - upgrade jwt-authorizer package on all 4 hosts to version 1.1.1-1 - T337474
[14:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:49] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ayounsi)
[14:51:54] <stashbot>	 T337474: Replace deprecated `CI_JOB_JWT` CI variable in Kokkuri - https://phabricator.wikimedia.org/T337474
[14:52:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow3003.esams.wmnet with reason: host reimage
[14:54:13] <wikibugs>	 (03PS1) 10Ayounsi: Update netflow collector IP [homer/public] - 10https://gerrit.wikimedia.org/r/949533 (https://phabricator.wikimedia.org/T329219)
[14:56:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow3003.esams.wmnet with reason: host reimage
[14:57:59] <logmsgbot>	 !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@155299c] (releasing): (no justification provided)
[14:58:29] <wikibugs>	 (03PS1) 10Ayounsi: Update esams netflow collector [puppet] - 10https://gerrit.wikimedia.org/r/949534 (https://phabricator.wikimedia.org/T329219)
[14:58:40] <logmsgbot>	 !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@155299c] (releasing): (no justification provided) (duration: 00m 41s)
[15:00:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast3006.wikimedia.org
[15:00:24] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1102.eqiad.wmnet with OS bullseye
[15:00:41] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1103.eqiad.wmnet with OS bullseye
[15:00:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[15:01:39] <wikibugs>	 (03PS1) 10Jbond: admin: add taavi to ops group [puppet] - 10https://gerrit.wikimedia.org/r/949536 (https://phabricator.wikimedia.org/T342307)
[15:01:53] <wikibugs>	 10SRE, 10Math, 10RESTbase Sunsetting, 10Traffic: Determin the cause of a sudden 80% drop in requests to math endpoints - https://phabricator.wikimedia.org/T344329 (10daniel)
[15:02:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Add netflow3003 to Ferm rules for Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/949537 (https://phabricator.wikimedia.org/T344355)
[15:02:57] <icinga-wm>	 RECOVERY - config-master.wikimedia.org requires authentication on config-master2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:03:06] <wikibugs>	 10SRE, 10Math, 10RESTbase Sunsetting, 10Traffic: Determin the cause of x8 increase in requests to math endpoints between july 6 and August 3 - https://phabricator.wikimedia.org/T344329 (10daniel)
[15:03:11] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] wikifeeds: Use GET instead of POST for mwapi requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/949046 (https://phabricator.wikimedia.org/T343950) (owner: 10Jgiannelos)
[15:04:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[15:04:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/949534 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[15:04:36] <wikibugs>	 (03CR) 10Jbond: "ready" [puppet] - 10https://gerrit.wikimedia.org/r/949536 (https://phabricator.wikimedia.org/T342307) (owner: 10Jbond)
[15:04:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [homer/public] - 10https://gerrit.wikimedia.org/r/949533 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[15:05:30] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Add netflow3003 to Ferm rules for Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/949537 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[15:05:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[15:08:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:08:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3006.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[15:09:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3006.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[15:09:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:09:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast3006.wikimedia.org
[15:10:07] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast3006.wikimedia.org` - bast3006.wikimedia.org (**PASS**)   - Downt...
[15:10:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ping3003.esams.wmnet
[15:13:19] <wikibugs>	 (03PS1) 10Effie Mouzeli: tegola-vector-tiles: update image on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/949540 (https://phabricator.wikimedia.org/T344324)
[15:13:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove bast3006/ping3003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949541 (https://phabricator.wikimedia.org/T344355)
[15:14:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow3003.esams.wmnet with OS bookworm
[15:14:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow3003.esams.wmnet
[15:14:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[15:14:33] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1103.eqiad.wmnet with reason: host reimage
[15:14:55] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1102.eqiad.wmnet with reason: host reimage
[15:14:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Remove bast3006/ping3003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949541 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[15:15:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:17:27] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1103.eqiad.wmnet with reason: host reimage
[15:18:04] <wikibugs>	 (03PS10) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056)
[15:18:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[15:18:35] <wikibugs>	 (03CR) 10Ayounsi: "1 comment but lgtm otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:18:46] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff)
[15:18:54] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: update image on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/949540 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli)
[15:18:59] <wikibugs>	 (03CR) 10Jbond: "ready" [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[15:19:04] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: update image on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/949540 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli)
[15:19:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove bast3006/ping3003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949541 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[15:19:15] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] hiera: LVS: update tagged_subnets for esams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:19:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Update netflow collector IP [homer/public] - 10https://gerrit.wikimedia.org/r/949533 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[15:19:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Update esams netflow collector [puppet] - 10https://gerrit.wikimedia.org/r/949534 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi)
[15:19:56] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1102.eqiad.wmnet with reason: host reimage
[15:20:02] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] hiera: LVS: update tagged_subnets for esams [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:20:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:24] <wikibugs>	 (03Merged) 10jenkins-bot: tegola-vector-tiles: update image on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/949540 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli)
[15:20:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[15:20:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:20:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping3003.esams.wmnet
[15:20:41] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ping3003.esams.wmnet` - ping3003.esams.wmnet (**PASS**)   - Downtimed...
[15:21:11] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff)
[15:21:31] <wikibugs>	 (03CR) 10Jbond: release: add additional instructions (032 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/949527 (owner: 10Jbond)
[15:21:33] <wikibugs>	 (03PS1) 10Fabfur: hiera: decommission dns3001 and dns3002 [puppet] - 10https://gerrit.wikimedia.org/r/949542 (https://phabricator.wikimedia.org/T329219)
[15:21:56] <wikibugs>	 (03PS2) 10Jbond: release: add additional instructions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/949527
[15:23:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:24:24] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] tests: fix CertificateState tests on python 3.10+ [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949505 (https://phabricator.wikimedia.org/T344330) (owner: 10Vgutierrez)
[15:25:27] <wikibugs>	 (03PS1) 10Muehlenhoff: New install server for new esams [puppet] - 10https://gerrit.wikimedia.org/r/949543 (https://phabricator.wikimedia.org/T344355)
[15:27:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:27:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] New install server for new esams [puppet] - 10https://gerrit.wikimedia.org/r/949543 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff)
[15:28:21] <wikibugs>	 10SRE, 10Traffic: Q1:unified decommission task for old esams hosts (knams migration) - https://phabricator.wikimedia.org/T344363 (10ssingh)
[15:28:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install3003.wikimedia.org
[15:28:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[15:29:16] <wikibugs>	 (03PS1) 10BCornwall: Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154)
[15:29:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[15:29:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10jbond) >>! In T343039#9068357, @Marostegui wrote: > We really need to come up with a way to be able to grant root access to clouddb* hosts that doesn't imply root on al...
[15:30:13] <wikibugs>	 (03CR) 10BCornwall: Release 0.36-2 for Bookworm (032 comments) [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/948672 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[15:30:40] <wikibugs>	 (03PS2) 10Ssingh: P:pybal: update bgp-peer-address for asw1-b*27-esams [puppet] - 10https://gerrit.wikimedia.org/r/949531 (https://phabricator.wikimedia.org/T329219)
[15:30:58] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:09] <wikibugs>	 (03PS2) 10Jbond: P:puppetserver: add support for extra_mounts [puppet] - 10https://gerrit.wikimedia.org/r/948607 (https://phabricator.wikimedia.org/T341056)
[15:33:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install3003.wikimedia.org - jmm@cumin2002"
[15:33:16] <wikibugs>	 (03PS2) 10Jbond: puppetserver: add volatile file mount [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056)
[15:33:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install3003.wikimedia.org - jmm@cumin2002"
[15:33:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:33:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install3003.wikimedia.org on all recursors
[15:34:00] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] common: update ntp_servers with dns300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/949525 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:34:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install3003.wikimedia.org on all recursors
[15:34:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install3003.wikimedia.org - jmm@cumin2002"
[15:35:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install3003.wikimedia.org - jmm@cumin2002"
[15:35:34] <wikibugs>	 (03CR) 10Jbond: P:puppetserver: add support for extra_mounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948607 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[15:36:02] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[15:36:17] <wikibugs>	 (03CR) 10Ayounsi: P:pybal: update bgp-peer-address for asw1-b*27-esams (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/949531 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:37:18] <wikibugs>	 (03PS1) 10Esanders: Disable upcoming wgMFShowEditNotices in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949545 (https://phabricator.wikimedia.org/T312587)
[15:38:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10nskaggs) > @nskaggs As the group owner are you able to approve this request  Yes, I approve.
[15:38:16] <wikibugs>	 (03PS3) 10Ssingh: P:pybal: update bgp-peer-address for asw1-b*27-esams [puppet] - 10https://gerrit.wikimedia.org/r/949531 (https://phabricator.wikimedia.org/T329219)
[15:38:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install3003.wikimedia.org with OS bullseye
[15:39:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts netflow3002.esams.wmnet
[15:39:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] P:pybal: update bgp-peer-address for asw1-b*27-esams [puppet] - 10https://gerrit.wikimedia.org/r/949531 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:39:58] <sukhe>	 !log homer "mr*" commit "add ntp_servers add dns300[34]"
[15:40:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:40:49] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: LVS: update tagged_subnets for esams [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:41:16] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1103.eqiad.wmnet with OS bullseye
[15:42:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] P:pybal: update bgp-peer-address for asw1-b*27-esams [puppet] - 10https://gerrit.wikimedia.org/r/949531 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[15:43:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[15:43:24] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1102.eqiad.wmnet with OS bullseye
[15:44:36] <jnuche>	 jouncebot: nowandnext
[15:44:36] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 15 minute(s)
[15:44:36] <jouncebot>	 In 1 hour(s) and 15 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1700)
[15:45:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[15:45:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:45:51] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42904/console" [puppet] - 10https://gerrit.wikimedia.org/r/949530 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro)
[15:46:38] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job fastnetmon in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:46:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[15:46:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:46:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow3002.esams.wmnet
[15:46:59] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `netflow3002.esams.wmnet` - netflow3002.esams.wmnet (**PASS**)   - Dow...
[15:47:17] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "Tested now with one of enabling and one disabling envoy." [puppet] - 10https://gerrit.wikimedia.org/r/949530 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro)
[15:47:18] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff)
[15:48:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:28] <sukhe>	 !log restart pybal on new lvses in esams
[15:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:30] <wikibugs>	 (03PS1) 10BryanDavis: shellbox: Bump to 2023-08-15-040901 [deployment-charts] - 10https://gerrit.wikimedia.org/r/949548 (https://phabricator.wikimedia.org/T335460)
[15:51:20] <bd808>	 legoktm: Do you have any practical advice for how to test shellbox containers in staging? Asking in reference to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/949548/
[15:52:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install3003.wikimedia.org with reason: host reimage
[15:52:42] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:55:32] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hiera: decommission dns3001 and dns3002 [puppet] - 10https://gerrit.wikimedia.org/r/949542 (https://phabricator.wikimedia.org/T329219) (owner: 10Fabfur)
[15:56:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install3003.wikimedia.org with reason: host reimage
[15:56:42] <wikibugs>	 (03PS1) 10BCornwall: Remove esams hosts prior to knams migration [puppet] - 10https://gerrit.wikimedia.org/r/949551 (https://phabricator.wikimedia.org/T329219)
[15:57:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Make install3003 the new install server for esams [puppet] - 10https://gerrit.wikimedia.org/r/949552 (https://phabricator.wikimedia.org/T344355)
[15:58:30] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] hiera: decommission dns3001 and dns3002 [puppet] - 10https://gerrit.wikimedia.org/r/949542 (https://phabricator.wikimedia.org/T329219) (owner: 10Fabfur)
[16:00:23] <fabfur>	 !log running puppet-agent on A:cumin A:dns-rec A:netbox to remove dns3001 and dns3002
[16:00:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:19] <wikibugs>	 (03PS11) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056)
[16:02:21] <wikibugs>	 (03PS3) 10Jbond: P:puppetserver: add support for extra_mounts [puppet] - 10https://gerrit.wikimedia.org/r/948607 (https://phabricator.wikimedia.org/T341056)
[16:02:23] <wikibugs>	 (03PS3) 10Jbond: puppetserver: add volatile file mount [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056)
[16:04:14] <wikibugs>	 (03CR) 10Majavah: "<3" [puppet] - 10https://gerrit.wikimedia.org/r/949536 (https://phabricator.wikimedia.org/T342307) (owner: 10Jbond)
[16:05:18] <wikibugs>	 (03PS2) 10BCornwall: Remove esams hosts prior to knams migration [puppet] - 10https://gerrit.wikimedia.org/r/949551 (https://phabricator.wikimedia.org/T344363)
[16:05:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: add taavi to ops group [puppet] - 10https://gerrit.wikimedia.org/r/949536 (https://phabricator.wikimedia.org/T342307) (owner: 10Jbond)
[16:05:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] p:tlsproxy::envoy: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949530 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro)
[16:05:58] <jbond>	 taavi FYI ^^^ is merged let me know if you want me toi run puppet anywhere specific
[16:06:15] <urbanecm>	 congratulations taavi! :)
[16:06:30] <taavi>	 jbond: thank you!! I think I'm fine waiting for puppet to run naturally
[16:06:37] <jbond>	 ack sgtm
[16:06:40] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff)
[16:06:53] <taavi>	 can I add myself to ldap/ops or do I need to request that separately?
[16:07:07] <jbond>	 taavi ill do that as well one sec
[16:07:46] * jbond sees many taavi* users in ldap :)
[16:08:10] <taavi>	 that might happen if you're trying to debug the authentication system :P
[16:08:14] <urbanecm>	 :D
[16:08:27] * urbanecm looks at the `MU test *` accounts in SUL
[16:08:34] <jbond>	 :) ok thats done now as well welcome and congrats :)
[16:11:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install3003.wikimedia.org with OS bullseye
[16:11:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install3003.wikimedia.org
[16:12:15] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH)
[16:12:35] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts doh[3001-3002].wikimedia.org
[16:16:39] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job fastnetmon in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:17:52] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[16:19:14] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts dns3001.wikimedia.org
[16:21:10] <jbond>	 !log mv /var/lib/puppet/volatile/misc /home/jbond on puppetmaster1001 as it (legacy geoip data) appears unused
[16:21:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Login rejected on horizon.wikimedia.org - https://phabricator.wikimedia.org/T344367 (10darthmon_wmde)
[16:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:24] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh[3001-3002].wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[16:21:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job fastnetmon in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:22:25] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh[3001-3002].wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[16:22:25] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:22:26] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh[3001-3002].wikimedia.org
[16:22:37] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `doh[3001-3002].wikimedia.org` - doh3001.wikimedia.org (**PASS**)...
[16:23:05] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts durum[3001-3002].esams.wmnet
[16:23:58] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1104.eqiad.wmnet with OS bullseye
[16:24:02] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.dns.netbox
[16:24:04] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1105.eqiad.wmnet with OS bullseye
[16:25:07] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[16:25:08] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dns3001.wikimedia.org
[16:25:19] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `dns3001.wikimedia.org` - dns3001.wikimedia.org (**PASS**)   - Downti...
[16:28:03] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[16:28:20] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] tools-static: Hide more Cloudflare response headers [puppet] - 10https://gerrit.wikimedia.org/r/940506 (owner: 10Lucas Werkmeister)
[16:30:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum[3001-3002].esams.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[16:31:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:31:21] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum[3001-3002].esams.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[16:31:21] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:31:22] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum[3001-3002].esams.wmnet
[16:31:32] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: stop creating the volatile/misc folder [puppet] - 10https://gerrit.wikimedia.org/r/949554 (https://phabricator.wikimedia.org/T341717)
[16:31:35] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `durum[3001-3002].esams.wmnet` - durum3001.esams.wmnet (**PASS**)...
[16:31:38] <jnuche>	 !log restarting CI Jenkins to update plugins
[16:31:38] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job bird in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:31:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:47] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts dns3002.wikimedia.org
[16:33:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:34:41] <jbond>	 !log mv /var/lib/puppet/volatile/squid /home/jbond on puppetmaster1001 as it appears unused
[16:34:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:00] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Remove esams hosts prior to knams migration [puppet] - 10https://gerrit.wikimedia.org/r/949551 (https://phabricator.wikimedia.org/T344363) (owner: 10BCornwall)
[16:36:40] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.dns.netbox
[16:37:10] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:37:15] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ssingh)
[16:38:01] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1104.eqiad.wmnet with reason: host reimage
[16:38:12] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1105.eqiad.wmnet with reason: host reimage
[16:39:09] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir[3001-3002].esams.wmnet
[16:39:24] <wikibugs>	 10SRE, 10PyBal, 10Scap, 10Traffic, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10thcipriani)
[16:40:08] <wikibugs>	 (03PS4) 10Jbond: puppetserver: add volatile file mount [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056)
[16:40:23] <wikibugs>	 (03CR) 10Jbond: puppetserver: add volatile file mount (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond)
[16:40:23] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts ncredir[3001-3002].esams.wmnet
[16:40:29] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns3002.wikimedia.org decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001"
[16:41:21] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns3002.wikimedia.org decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001"
[16:41:21] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:41:22] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns3002.wikimedia.org
[16:41:36] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1104.eqiad.wmnet with reason: host reimage
[16:41:38] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job bird in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:42:05] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `dns3002.wikimedia.org` - dns3002.wikimedia.org (**PASS**)   - Downti...
[16:42:51] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Remove esams hosts prior to knams migration [puppet] - 10https://gerrit.wikimedia.org/r/949551 (https://phabricator.wikimedia.org/T344363) (owner: 10BCornwall)
[16:43:14] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:43:42] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1105.eqiad.wmnet with reason: host reimage
[16:43:48] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@ec5d4cd]: T342213
[16:43:51] <stashbot>	 T342213: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213
[16:44:07] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:44:46] <wikibugs>	 (03PS1) 10Eevans: restbase: move (temporary) per-host settings back to role [puppet] - 10https://gerrit.wikimedia.org/r/949556 (https://phabricator.wikimedia.org/T339298)
[16:44:47] <logmsgbot>	 !log btullis@deploy1002 deploy aborted: T342213 (duration: 00m 59s)
[16:45:39] <wikibugs>	 (03PS2) 10Eevans: restbase: move (temporary) per-host settings back to role [puppet] - 10https://gerrit.wikimedia.org/r/949556 (https://phabricator.wikimedia.org/T339298)
[16:45:57] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@ec5d4cd] (aqs): T342213
[16:46:07] <jinxer-wm>	 (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip6) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:46:21] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q1:unified decommission task for old esams hosts (knams migration) - https://phabricator.wikimedia.org/T344363 (10Fabfur)
[16:46:38] <jinxer-wm>	 (JobUnavailable) resolved: (4) Reduced availability for job bird in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:46:49] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949556 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans)
[16:47:07] <jinxer-wm>	 (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:47:21] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[3050-3053].esams.wmnet
[16:47:21] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs3009 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:47:39] <jynus>	 esams and other related hosts complaining
[16:47:45] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/aqs/deploy@ec5d4cd] (aqs): T342213 (duration: 01m 48s)
[16:49:03] <sukhe>	 jynus: yeah decomissioning in progress
[16:49:07] <sukhe>	 probably should downtime
[16:49:18] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: move (temporary) per-host settings back to role [puppet] - 10https://gerrit.wikimedia.org/r/949556 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans)
[16:49:46] <jinxer-wm>	 (ConfdResourceFailed) firing: (24) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:50:21] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs3008 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:50:39] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs3010 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:50:41] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:51:13] <sukhe>	 silencing
[16:51:48] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213
[16:51:51] <stashbot>	 T342213: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213
[16:52:01] <sukhe>	 !log restarting ntp service in core sites
[16:52:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:57] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:52:59] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213
[16:54:18] <wikibugs>	 (03PS1) 10Ssingh: ncredir300x: decommission hosts in esams [puppet] - 10https://gerrit.wikimedia.org/r/949558 (https://phabricator.wikimedia.org/T344355)
[16:54:47] <jinxer-wm>	 (ConfdResourceFailed) firing: (96) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:57:00] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213 (duration: 04m 00s)
[16:57:03] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213
[16:57:07] <jinxer-wm>	 (ProbeDown) firing: (14) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:57:08] <stashbot>	 T342213: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213
[16:57:11] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:33] <wikibugs>	 (03PS2) 10Ssingh: ncredir300x: decommission hosts in esams [puppet] - 10https://gerrit.wikimedia.org/r/949558 (https://phabricator.wikimedia.org/T344355)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1700)
[17:00:13] <wikibugs>	 (03PS3) 10Ssingh: ncredir300x: decommission hosts in esams [puppet] - 10https://gerrit.wikimedia.org/r/949558 (https://phabricator.wikimedia.org/T344355)
[17:01:52] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[17:02:19] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.dns.netbox
[17:04:40] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1104.eqiad.wmnet with OS bullseye
[17:06:08] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213 (duration: 09m 04s)
[17:06:11] <stashbot>	 T342213: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213
[17:06:23] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3050-3053].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002"
[17:06:52] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[17:06:55] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1105.eqiad.wmnet with OS bullseye
[17:07:26] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3050-3053].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002"
[17:07:26] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:07:27] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[3050-3053].esams.wmnet
[17:07:39] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brett@cumin2002 for hosts: `cp[3050-3053].esams.wmnet` - cp3050.esams.wmnet (**PASS**)   - Downti...
[17:09:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:10:04] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[3054-3057].esams.wmnet
[17:14:20] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1006 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[17:14:30] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve2005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:19:24] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[17:19:44] <sukhe>	 ^ these will be resolving soon, space restarts are in progress
[17:20:00] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.dns.netbox
[17:20:07] <wikibugs>	 (03CR) 10Ssingh: "Not sure if needed but I thought I should check with you before decomm." [puppet] - 10https://gerrit.wikimedia.org/r/949558 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh)
[17:22:13] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3054-3057].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002"
[17:23:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:23:30] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3054-3057].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002"
[17:23:30] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:23:31] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[3054-3057].esams.wmnet
[17:23:42] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brett@cumin2002 for hosts: `cp[3054-3057].esams.wmnet` - cp3054.esams.wmnet (**PASS**)   - Downti...
[17:26:52] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[3058-3061].esams.wmnet
[17:27:26] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:28:50] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[17:31:02] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMDE and LDAP/NDA for mareikeheuer - https://phabricator.wikimedia.org/T344341 (10KFrancis) Thank you.  The NDA is out for signatures.
[17:33:49] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[17:36:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:38:16] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.dns.netbox
[17:40:17] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3058-3061].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002"
[17:40:53] <wikibugs>	 (03PS1) 10Ssingh: lvs300[5-7]: decommission old esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/949563 (https://phabricator.wikimedia.org/T344363)
[17:41:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:42:07] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42905/console" [puppet] - 10https://gerrit.wikimedia.org/r/949563 (https://phabricator.wikimedia.org/T344363) (owner: 10Ssingh)
[17:43:10] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3058-3061].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002"
[17:43:10] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:43:11] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[3058-3061].esams.wmnet
[17:43:20] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q1:unified decommission task for old esams hosts (knams migration) - https://phabricator.wikimedia.org/T344363 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brett@cumin2002 for hosts: `cp[3058-3061].esams.wmnet` - cp3058.esams.wmnet (**PASS**)   - D...
[17:45:26] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[3062-3065].esams.wmnet
[17:45:31] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[17:45:34] <sukhe>	 ^ expected
[17:46:21] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[17:50:20] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] lvs300[5-7]: decommission old esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/949563 (https://phabricator.wikimedia.org/T344363) (owner: 10Ssingh)
[17:50:27] <fabfur>	 !log run puppet-agent on A:dns-rec to restart ntp service
[17:50:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:20] <fabfur>	 !log restart ntp on A:dns-rec and A:edges'
[17:52:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:39] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] lvs300[5-7]: decommission old esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/949563 (https://phabricator.wikimedia.org/T344363) (owner: 10Ssingh)
[17:54:33] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.dns.netbox
[17:55:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:56:29] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3062-3065].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002"
[17:56:38] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job trafficserver-text in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:57:22] <sukhe>	 expected, decom
[17:58:37] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs[3005-3007].esams.wmnet
[17:58:52] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3062-3065].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002"
[17:58:52] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:58:53] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[3062-3065].esams.wmnet
[18:00:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:00:05] <jouncebot>	 brennen and dancy: (Dis)respected human, time to deploy Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1800). Please do the needful.
[18:00:05] <jouncebot>	 brennen and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1800).
[18:00:11] <dancy>	 o/
[18:00:24] <dancy>	 Train is unblocked.  Pressing the buttons.
[18:00:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:01:15] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949565 (https://phabricator.wikimedia.org/T343724)
[18:01:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949565 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot)
[18:02:06] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949565 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot)
[18:03:06] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:06:07] <jinxer-wm>	 (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:06:24] <sukhe>	 uh oh
[18:06:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[18:06:38] <sukhe>	 probably the LVS removal
[18:06:58] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me. Let's add the secret and test tomorrow." [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[18:06:59] <sukhe>	 but esams is depooled
[18:07:01] <sukhe>	 can someone ACK it?
[18:07:06] <sukhe>	 yeah it was that
[18:07:07] <jinxer-wm>	 (ProbeDown) firing: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:07:47] <sukhe>	 arnoldokoth: thanks for ACK
[18:09:27] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs[3005-3007].esams.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[18:09:34] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] datahub: Enable OIDC to idp_test (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[18:10:14] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.22  refs T343724
[18:10:18] <stashbot>	 T343724: 1.41.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T343724
[18:10:31] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs[3005-3007].esams.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[18:10:31] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:10:32] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs[3005-3007].esams.wmnet
[18:10:38] <dancy>	 I'm going to let the train marinate on group0 for an hour.
[18:10:43] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs[3005-3007].esams.wmnet` - lvs3005.esams.wmnet (**PASS**)   - Do...
[18:11:39] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:12:16] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:15:49] <wikibugs>	 10SRE-swift-storage, 10Data-Persistence, 10Discovery-Search (Current work): Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10bking)
[18:16:39] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:16:49] <wikibugs>	 (03PS1) 10Eevans: restbase: set legacy ssl port & optional encryption to false [puppet] - 10https://gerrit.wikimedia.org/r/949587 (https://phabricator.wikimedia.org/T339298)
[18:18:14] <arnoldokoth>	 sukhe: np.
[18:19:40] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:19:42] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:21:33] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949587 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans)
[18:26:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:29:12] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Clarify 2017 wikitext editor's Beta Feature status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949588 (https://phabricator.wikimedia.org/T344158)
[18:30:12] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:33:50] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:42:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:43:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:45:40] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:45:44] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:46:34] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:48:10] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:48:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:50:00] <jinxer-wm>	 (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:52:06] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:52:34] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:59:06] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Remove unusual VisualEditor config for Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949592 (https://phabricator.wikimedia.org/T241961)
[18:59:08] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Remove unused RESTBase-related VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949593 (https://phabricator.wikimedia.org/T341618)
[19:00:37] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Remove unusual VisualEditor config for Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949592 (https://phabricator.wikimedia.org/T241961)
[19:01:40] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders)
[19:01:58] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Disable upcoming wgMFShowEditNotices in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949545 (https://phabricator.wikimedia.org/T312587) (owner: 10Esanders)
[19:02:17] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders)
[19:03:10] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10andrea.denisse) 05In progress→03Resolved Hi, I sent a patch for this change that was awaiting review.  https://gerrit.wikimedia.org/r/c/operations/puppet/+/940269/  Closing...
[19:04:11] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 04-1] "I want to do this next week" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949593 (https://phabricator.wikimedia.org/T341618) (owner: 10Bartosz Dziewoński)
[19:07:18] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 04-1] "Will deploy together with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/947015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949588 (https://phabricator.wikimedia.org/T344158) (owner: 10Bartosz Dziewoński)
[19:11:11] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10thcipriani) Approved from the `deployment` group. Rationale makes sense.
[19:11:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10thcipriani)
[19:11:58] <wikibugs>	 (03Abandoned) 10Andrea Denisse: groups: Add taavi to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/940269 (https://phabricator.wikimedia.org/T342307) (owner: 10Andrea Denisse)
[19:14:34] <wikibugs>	 (03PS2) 10Gehel: [WIP] Start Balzegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503
[19:25:08] <wikibugs>	 (03PS3) 10Gehel: [WIP] Start Balzegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361)
[19:25:35] <wikibugs>	 (03CR) 10Ahmon Dancy: [WIP] Start Balzegraph from systemd unit, without runBlazegraph.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[19:30:51] <dancy>	 Rolling the train to group1
[19:31:03] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949594 (https://phabricator.wikimedia.org/T343724)
[19:31:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949594 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot)
[19:31:36] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Update kask container image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/913949 (https://phabricator.wikimedia.org/T335691) (owner: 10Ahmon Dancy)
[19:31:47] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949594 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot)
[19:36:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:40:46] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.22  refs T343724
[19:40:50] <stashbot>	 T343724: 1.41.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T343724
[19:41:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:48:00] <logmsgbot>	 !log dancy@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.22  refs T343724 (duration: 07m 14s)
[19:48:04] <stashbot>	 T343724: 1.41.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T343724
[19:48:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:53:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:58:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T2000)
[20:00:05] <jouncebot>	 MatmaRex and aanzx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:15] <urbanecm>	 i can deploy today
[20:00:44] <MatmaRex>	 hi
[20:00:48] <urbanecm>	 hey!
[20:01:34] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove unusual VisualEditor config for Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949592 (https://phabricator.wikimedia.org/T241961) (owner: 10Bartosz Dziewoński)
[20:01:44] <urbanecm>	 i'm fond of experiments, so...let's see :)
[20:01:56] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Disable upcoming wgMFShowEditNotices in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949545 (https://phabricator.wikimedia.org/T312587) (owner: 10Esanders)
[20:02:19] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unusual VisualEditor config for Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949592 (https://phabricator.wikimedia.org/T241961) (owner: 10Bartosz Dziewoński)
[20:02:35] <wikibugs>	 (03Merged) 10jenkins-bot: Disable upcoming wgMFShowEditNotices in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949545 (https://phabricator.wikimedia.org/T312587) (owner: 10Esanders)
[20:02:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders)
[20:02:52] <wikibugs>	 (03PS5) 10Urbanecm: Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders)
[20:02:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders)
[20:04:02] <wikibugs>	 (03Merged) 10jenkins-bot: Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders)
[20:04:36] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949592|Remove unusual VisualEditor config for Wikitech (T241961)]], [[gerrit:949545|Disable upcoming wgMFShowEditNotices in production (T312587)]], [[gerrit:943558|Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi]]
[20:04:44] <stashbot>	 T312587: Show edit notices within mobile editing interfaces - https://phabricator.wikimedia.org/T312587
[20:04:44] <stashbot>	 T241961: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961
[20:04:46] <urbanecm>	 aanzx: are you around too?
[20:06:16] <logmsgbot>	 !log urbanecm@deploy1002 esanders and urbanecm and matmarex: Backport for [[gerrit:949592|Remove unusual VisualEditor config for Wikitech (T241961)]], [[gerrit:949545|Disable upcoming wgMFShowEditNotices in production (T312587)]], [[gerrit:943558|Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwde
[20:06:16] <logmsgbot>	 bug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:06:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:07:05] <urbanecm>	 MatmaRex: all three pulled to mwdebug, but afaics, they're not testable (wikitech's not XWD-enabled and rest are no-ops). is that right?
[20:07:29] <MatmaRex>	 yeah, i just noticed that the mwdebug stuff doesn't work on wikitech :/
[20:07:34] <MatmaRex>	 i guess we're testing this one in production
[20:07:38] <MatmaRex>	 the rest are indeed no-ops
[20:07:43] <urbanecm>	 yeah, i have to sync that out and we'll see
[20:07:44] <logmsgbot>	 !log urbanecm@deploy1002 esanders and urbanecm and matmarex: Continuing with sync
[20:07:46] <urbanecm>	 proceeding
[20:11:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:14:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:14:04] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949592|Remove unusual VisualEditor config for Wikitech (T241961)]], [[gerrit:949545|Disable upcoming wgMFShowEditNotices in production (T312587)]], [[gerrit:943558|Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi]] (duration: 09m 27s)
[20:14:09] <stashbot>	 T312587: Show edit notices within mobile editing interfaces - https://phabricator.wikimedia.org/T312587
[20:14:09] <stashbot>	 T241961: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961
[20:14:20] <urbanecm>	 MatmaRex: deployed to prod. can you test the wikitech stuff please? :)
[20:15:16] <MatmaRex>	 visual editor seems to work: https://wikitech.wikimedia.org/w/index.php?title=Sandbox&diff=prev&oldid=2100385
[20:15:32] <MatmaRex>	 thanks for deploying
[20:15:56] <urbanecm>	 great!
[20:16:10] <urbanecm>	 aanzx: you around?
[20:16:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:17:21] <aanzx>	 urbanecm: yes
[20:17:47] <urbanecm>	 ok, let's deploy. 
[20:18:00] <wikibugs>	 (03PS5) 10Urbanecm: Some initial configurations for suwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[20:18:02] <aanzx>	 Ok
[20:18:38] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:19:11] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Some initial configurations for suwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[20:19:51] <wikibugs>	 (03Merged) 10jenkins-bot: Some initial configurations for suwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx)
[20:20:30] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949183|Some initial configurations for suwikisource (T344314)]]
[20:20:34] <stashbot>	 T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314
[20:22:46] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:949183|Some initial configurations for suwikisource (T344314)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:22:55] <urbanecm>	 aanzx: please test
[20:23:01] <aanzx>	 Testing 
[20:23:04] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:25:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10thcipriani) @Mabualruz I can't remember have you done our https://wikitech.wikimedia.org/wiki/Deployments/Training before?  I can't seem to find a task...
[20:25:19] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:25:40] <aanzx>	 urbanecm: tested looks good 
[20:25:45] <urbanecm>	 thanks, syncing
[20:25:46] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and anzx: Continuing with sync
[20:32:23] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:32:31] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2002.codfw.wmnet
[20:32:32] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[20:32:41] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949183|Some initial configurations for suwikisource (T344314)]] (duration: 12m 11s)
[20:32:44] <stashbot>	 T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314
[20:32:55] <urbanecm>	 aanzx: live
[20:33:08] <aanzx>	 urbanecm: ok thanks 
[20:34:36] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2002.codfw.wmnet - bking@cumin1001"
[20:35:23] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2002.codfw.wmnet - bking@cumin1001"
[20:35:23] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:35:23] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2002.codfw.wmnet on all recursors
[20:35:26] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk2002.codfw.wmnet on all recursors
[20:35:52] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2002.codfw.wmnet - bking@cumin1001"
[20:36:36] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2002.codfw.wmnet - bking@cumin1001"
[20:37:23] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:37:34] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2002.codfw.wmnet with OS bookworm
[20:51:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:55:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (96) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:56:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T2100)
[21:40:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:47:30] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:49:33] <ryankemper>	 !log T343124 [WDQS] Pooled `wdqs1012` and `wdqs1013` (passing checks after reimage/data transfer)
[21:49:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:37] <stashbot>	 T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124
[21:52:27] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk2002.codfw.wmnet with OS bookworm
[21:52:27] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk2002.codfw.wmnet
[22:01:06] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:05:44] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:06:08] <jinxer-wm>	 (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:07:07] <jinxer-wm>	 (ProbeDown) firing: (13) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:13:34] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:13:53] <rzl>	 hm, that's just esams but I thought it was silenced already
[22:14:32] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:14:46] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:15:09] <wikibugs>	 (03PS4) 10Bking: [WIP] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel)
[22:16:39] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job trafficserver-text in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:16:48] <rzl>	 oh I see, it looks like s.ukhe's silence covered module=http_ncredir-https_ip[46] but only family=ip4, adding ip6 now
[22:18:04] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:18:52] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:19:06] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:20:14] <rzl>	 done, and matched both values for address too
[22:27:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:31:42] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:50:00] <jinxer-wm>	 (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:11:08] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:18:36] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:20:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:24:32] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state