[00:04:17] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1011 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:19:21] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:20:57] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:25:13] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:29:53] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:59] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:06] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948687 (owner: 10TrainBranchBot) [00:31:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948694 [00:38:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948694 (owner: 10TrainBranchBot) [00:57:05] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/948694 (owner: 10TrainBranchBot) [00:57:29] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:09:42] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940486 (https://phabricator.wikimedia.org/T342484) (owner: 10Hamish) [01:17:31] (03CR) 10Anzx: add 'autopatrol' to Wikifunctions' functioneer group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [01:21:22] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploy to freshly reimaged host [01:21:33] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploy to freshly reimaged host (duration: 00m 10s) [01:22:03] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploy to freshly reimaged host [01:22:14] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploy to freshly reimaged host (duration: 00m 10s) [01:26:33] (03PS1) 10Ryan Kemper: wdqs.data-transfer: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/949146 [01:29:33] jouncebot: nowandnext [01:29:33] No deployments scheduled for the next 4 hour(s) and 30 minute(s) [01:29:33] In 4 hour(s) and 30 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0600) [01:30:07] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:30:30] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:31:43] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:35:03] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [01:36:23] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 533 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:36:46] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:37:14] (03PS1) 10Majavah: OAuthUserRepository: Ensure we don't end up with duplicate rows [extensions/OATHAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949166 (https://phabricator.wikimedia.org/T242031) [01:37:52] (03PS1) 10Majavah: OAuthUserRepository: Ensure we don't end up with duplicate rows [extensions/OATHAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949167 (https://phabricator.wikimedia.org/T242031) [01:39:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/OATHAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949166 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [01:39:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/OATHAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949167 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [01:41:39] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:41:49] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:45:14] (03Merged) 10jenkins-bot: OAuthUserRepository: Ensure we don't end up with duplicate rows [extensions/OATHAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949166 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [01:45:17] (03Merged) 10jenkins-bot: OAuthUserRepository: Ensure we don't end up with duplicate rows [extensions/OATHAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949167 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [01:46:01] !log taavi@deploy1002 Started scap: Backport for [[gerrit:949166|OAuthUserRepository: Ensure we don't end up with duplicate rows (T242031)]], [[gerrit:949167|OAuthUserRepository: Ensure we don't end up with duplicate rows (T242031)]] [01:46:05] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [01:47:39] !log taavi@deploy1002 taavi: Backport for [[gerrit:949166|OAuthUserRepository: Ensure we don't end up with duplicate rows (T242031)]], [[gerrit:949167|OAuthUserRepository: Ensure we don't end up with duplicate rows (T242031)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD [01:47:39] option) [01:50:29] !log taavi@deploy1002 taavi: Continuing with sync [01:51:07] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:54:16] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:57:01] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949166|OAuthUserRepository: Ensure we don't end up with duplicate rows (T242031)]], [[gerrit:949167|OAuthUserRepository: Ensure we don't end up with duplicate rows (T242031)]] (duration: 10m 59s) [01:57:01] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:04] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [02:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:11:38] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:31] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:16:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:20:29] !log create oathauth_devices and oathauth_types tables on wikitech, private.dblist, fishbowl.dblist, centralauth T242031 [02:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:33] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [02:31:38] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:43] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:33] (03PS1) 10Majavah: Add READ_NEW | WRITE_NEW for OAuth multiple devices to techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949161 (https://phabricator.wikimedia.org/T242031) [02:49:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:53:47] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:56:51] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:56:53] (03PS2) 10Mdaniels5757: add 'autopatrol' to Wikifunctions' functioneer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) [03:14:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:13] (03PS2) 10Majavah: Set WRITE_BOTH for OAuth multiple devices to techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949161 (https://phabricator.wikimedia.org/T242031) [03:18:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:01] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:22:44] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:22:45] (03PS1) 10Majavah: Keep both tables up-to-date on WRITE_BOTH [extensions/OATHAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949168 (https://phabricator.wikimedia.org/T242031) [03:23:05] (03PS1) 10Majavah: Keep both tables up-to-date on WRITE_BOTH [extensions/OATHAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949169 (https://phabricator.wikimedia.org/T242031) [03:23:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/OATHAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949169 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [03:23:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/OATHAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949168 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [03:24:00] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [03:26:01] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:26:17] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:27:11] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:27:49] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:28:54] (03Merged) 10jenkins-bot: Keep both tables up-to-date on WRITE_BOTH [extensions/OATHAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949169 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [03:28:56] (03Merged) 10jenkins-bot: Keep both tables up-to-date on WRITE_BOTH [extensions/OATHAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949168 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [03:29:21] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:29:27] !log taavi@deploy1002 Started scap: Backport for [[gerrit:949169|Keep both tables up-to-date on WRITE_BOTH (T242031)]], [[gerrit:949168|Keep both tables up-to-date on WRITE_BOTH (T242031)]] [03:29:32] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [03:30:51] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:31:05] !log taavi@deploy1002 taavi: Backport for [[gerrit:949169|Keep both tables up-to-date on WRITE_BOTH (T242031)]], [[gerrit:949168|Keep both tables up-to-date on WRITE_BOTH (T242031)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [03:33:55] !log taavi@deploy1002 taavi: Continuing with sync [03:34:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:38:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:40:26] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949169|Keep both tables up-to-date on WRITE_BOTH (T242031)]], [[gerrit:949168|Keep both tables up-to-date on WRITE_BOTH (T242031)]] (duration: 10m 58s) [03:40:30] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [03:43:09] (03CR) 10Ladsgroup: [C: 03+2] Set WRITE_BOTH for OAuth multiple devices to techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949161 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [03:43:49] (03Merged) 10jenkins-bot: Set WRITE_BOTH for OAuth multiple devices to techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949161 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [03:44:25] !log taavi@deploy1002 Started scap: Backport for [[gerrit:949161|Set WRITE_BOTH for OAuth multiple devices to techconductwiki (T242031)]] [03:45:58] !log taavi@deploy1002 taavi: Backport for [[gerrit:949161|Set WRITE_BOTH for OAuth multiple devices to techconductwiki (T242031)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [03:46:02] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [03:47:10] !log taavi@deploy1002 taavi: Continuing with sync [03:48:41] (03PS1) 10Zabe: Initial configuration for suwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949189 (https://phabricator.wikimedia.org/T343539) [03:51:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:53:32] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949161|Set WRITE_BOTH for OAuth multiple devices to techconductwiki (T242031)]] (duration: 09m 07s) [03:53:35] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [03:55:28] (03CR) 10Zabe: [C: 03+2] Initial configuration for suwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949189 (https://phabricator.wikimedia.org/T343539) (owner: 10Zabe) [03:55:41] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:55:49] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:08] (03Merged) 10jenkins-bot: Initial configuration for suwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949189 (https://phabricator.wikimedia.org/T343539) (owner: 10Zabe) [03:56:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:58:11] !log create Wikisource Sundanese # T343539 [03:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:14] T343539: Create Wikisource Sundanese - https://phabricator.wikimedia.org/T343539 [03:58:51] !log zabe@deploy1002 Started scap: T343539 [04:00:41] !log zabe@deploy1002 zabe: T343539 synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [04:01:14] !log zabe@deploy1002 zabe: Continuing with sync [04:07:40] !log zabe@deploy1002 Finished scap: T343539 (duration: 08m 49s) [04:07:46] T343539: Create Wikisource Sundanese - https://phabricator.wikimedia.org/T343539 [04:10:44] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948705 [04:10:46] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948705 (owner: 10Zabe) [04:11:01] !log zabe@deploy1002 Started scap: update interwiki cache [04:11:23] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948705 (owner: 10Zabe) [04:19:18] !log zabe@deploy1002 Finished scap: update interwiki cache (duration: 08m 17s) [04:21:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:45] (03PS1) 10Zabe: Initial configuration for Wiktionary Pa'O [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949191 (https://phabricator.wikimedia.org/T343540) [04:24:24] (03CR) 10Zabe: [C: 03+2] Initial configuration for Wiktionary Pa'O [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949191 (https://phabricator.wikimedia.org/T343540) (owner: 10Zabe) [04:25:04] (03Merged) 10jenkins-bot: Initial configuration for Wiktionary Pa'O [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949191 (https://phabricator.wikimedia.org/T343540) (owner: 10Zabe) [04:25:45] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:19] !log zabe@deploy1002 Started scap: T343540 [04:28:23] T343540: Create Wiktionary Pa'O - https://phabricator.wikimedia.org/T343540 [04:29:17] !log create Wiktionary Pa'O # T343540 [04:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:54] !log zabe@deploy1002 zabe: T343540 synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [04:30:15] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:12] !log zabe@deploy1002 zabe: Continuing with sync [04:31:31] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [04:37:34] !log zabe@deploy1002 Finished scap: T343540 (duration: 09m 15s) [04:37:38] T343540: Create Wiktionary Pa'O - https://phabricator.wikimedia.org/T343540 [04:38:21] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [04:39:25] PROBLEM - Check systemd state on mw2330 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:54] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949207 [04:39:56] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949207 (owner: 10Zabe) [04:40:36] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949207 (owner: 10Zabe) [04:41:20] !log zabe@deploy1002 Started scap: update interwiki cache [04:44:58] (03CR) 10Vipz: [C: 03+1] Add "editautopatrolprotected", "patrol", "rollback" protection levels on sh.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (owner: 10Acamicamacaraca) [04:48:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:49:26] (03CR) 10Jforrester: "This doesn't need to go through Engineering but Product. It doesn't seem on-wiki that this has consensus yet, or a model of what rights wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [04:49:33] !log zabe@deploy1002 Finished scap: update interwiki cache (duration: 08m 12s) [04:50:17] (03PS3) 10Acamicamacaraca: Add "editautopatrolprotected", "patrol", "rollback" protection levels on sh.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 [04:51:41] (03PS4) 10Acamicamacaraca: Add "editautopatrolprotected", "rollback" and "patrol" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) [04:53:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:55:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:56:21] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 210, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:12:09] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:15:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:17:56] (03PS5) 10Acamicamacaraca: Add "editautopatrolprotected", "rollback" and "patrol" protection levels on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) [05:19:35] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:26:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:31:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:39:08] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:41:13] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:42:45] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:44:08] (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:56:41] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:59:17] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0600) [06:01:41] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:05:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3006.esams.wmnet with OS bullseye [06:07:17] (03PS1) 10Marostegui: install_server: Do not reimage db2190 [puppet] - 10https://gerrit.wikimedia.org/r/949400 [06:08:38] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2190 [puppet] - 10https://gerrit.wikimedia.org/r/949400 (owner: 10Marostegui) [06:15:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:57] (03PS2) 10Giuseppe Lavagetto: httpd-fcgi: de-quote unicode characters in logs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935) [06:24:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3006.esams.wmnet with reason: host reimage [06:25:29] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3006.esams.wmnet with reason: host reimage [06:28:09] jouncebot: nowandnext [06:28:09] For the next 0 hour(s) and 31 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0600) [06:28:09] In 0 hour(s) and 31 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0700) [06:28:28] is anyone using the mw infra window? can I run some maintenance scripts? [06:33:10] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:34:10] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:36:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:15] jouncebot: nowandnext [06:36:15] For the next 0 hour(s) and 23 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0600) [06:36:15] In 0 hour(s) and 23 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0700) [06:36:21] OK, deploying a fun patch. [06:37:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948080 (owner: 10Giuseppe Lavagetto) [06:38:56] (03Merged) 10jenkins-bot: wikifunctions: add / to the route for wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948080 (owner: 10Giuseppe Lavagetto) [06:39:18] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:29] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:948080|wikifunctions: add / to the route for wikifunctions]] [06:41:05] !log jforrester@deploy1002 oblivian and jforrester: Backport for [[gerrit:948080|wikifunctions: add / to the route for wikifunctions]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [06:41:25] !log jforrester@deploy1002 oblivian and jforrester: Continuing with sync [06:46:25] (03CR) 10Stevemunene: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [06:46:38] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [06:47:04] (03PS2) 10Jforrester: Enable url shortener in sidebar in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947823 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [06:47:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:47:53] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:948080|wikifunctions: add / to the route for wikifunctions]] (duration: 08m 24s) [06:52:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:56:22] (03PS1) 10Zabe: Some initial configurations for blkwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949404 (https://phabricator.wikimedia.org/T344310) [06:59:25] jouncebot: nowandnext [06:59:25] For the next 0 hour(s) and 0 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0600) [06:59:25] In 0 hour(s) and 0 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0700) [07:00:02] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: session-c64.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:04] Amir1, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T0700). Please do the needful. [07:00:04] Bas_dehaan, Daniuu, and Aca: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:12] Confirming that I'm present :) [07:00:27] Present as well [07:00:40] Also present [07:00:51] * taavi looks [07:02:22] \o [07:03:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [07:03:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3006.esams.wmnet with OS bullseye [07:03:26] Bas_dehaan: Daniuu: your patch needs a manual rebase [07:03:45] Bas_dehaan: doe jij dat even? [07:03:55] I’ll take a look [07:03:57] * urbanecm is currently calling-in to Wikimania, not positioned to deploy. [07:04:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3008.esams.wmnet with OS bullseye [07:04:15] urbanecm: are you here on-site? [07:04:54] taavi: nope, didn't pass staff selection. calling in from home. [07:04:57] (03CR) 10Majavah: [C: 04-2] "This needs an on-wiki discussion, and a much stronger justification, due to the relatively small size of those groups (and the large overl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [07:05:06] ah :( [07:06:03] (03PS1) 10Dreamy Jazz: clienthints: Collect client hints on group1 wikis except two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949405 (https://phabricator.wikimedia.org/T341110) [07:08:12] (03CR) 10Acamicamacaraca: Add "editautopatrolprotected", "rollback" and "patrol" protection levels on shwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [07:08:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949405 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [07:08:35] starting with Dreamy_Jazz's one [07:08:49] Thanks! [07:09:07] (03Merged) 10jenkins-bot: clienthints: Collect client hints on group1 wikis except two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949405 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [07:09:10] taavi: :( [07:09:35] !log taavi@deploy1002 Started scap: Backport for [[gerrit:949405|clienthints: Collect client hints on group1 wikis except two wikis (T341110)]] [07:09:40] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [07:11:21] !log taavi@deploy1002 taavi and dreamyjazz: Backport for [[gerrit:949405|clienthints: Collect client hints on group1 wikis except two wikis (T341110)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:14:33] !log taavi@deploy1002 taavi and dreamyjazz: Continuing with sync [07:18:12] Daniuu: Bas_dehaan: any news on the rebase? [07:18:24] Working on it :) [07:19:03] IDE had an update overnight and my git config got lost :( [07:19:07] But fixing it [07:21:13] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949405|clienthints: Collect client hints on group1 wikis except two wikis (T341110)]] (duration: 11m 38s) [07:21:17] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [07:21:30] zabe: are you around for deploying your patch? [07:21:34] o/ [07:22:25] (03PS2) 10Majavah: Some initial configurations for blkwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949404 (https://phabricator.wikimedia.org/T344310) (owner: 10Zabe) [07:22:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949404 (https://phabricator.wikimedia.org/T344310) (owner: 10Zabe) [07:23:33] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3008.esams.wmnet with reason: host reimage [07:23:39] (03Merged) 10jenkins-bot: Some initial configurations for blkwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949404 (https://phabricator.wikimedia.org/T344310) (owner: 10Zabe) [07:24:06] !log taavi@deploy1002 Started scap: Backport for [[gerrit:949404|Some initial configurations for blkwiktionary (T344310)]] [07:24:09] T344310: Initial configurations for blkwiktionary - https://phabricator.wikimedia.org/T344310 [07:25:37] (03Abandoned) 10Stevemunene: Add datahub_staging cname [dns] - 10https://gerrit.wikimedia.org/r/946851 (https://phabricator.wikimedia.org/T343236) (owner: 10Stevemunene) [07:25:44] !log taavi@deploy1002 taavi and zabe: Backport for [[gerrit:949404|Some initial configurations for blkwiktionary (T344310)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:26:18] looks good [07:26:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3008.esams.wmnet with reason: host reimage [07:26:49] !log taavi@deploy1002 taavi and zabe: Continuing with sync [07:30:32] 10SRE-OnFire, 10Incident Tooling, 10User-Joe: vopsbot incorrectly handles users with multiple teams - https://phabricator.wikimedia.org/T344316 (10Joe) [07:30:54] (03PS6) 10Bas dehaan: Added extended confirmed on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) [07:31:22] Just rebased [07:31:55] courtesy ping taavi :) [07:32:07] Thanks, Bas [07:32:19] perfect, will deploy that once this one finishes [07:33:24] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949404|Some initial configurations for blkwiktionary (T344310)]] (duration: 09m 17s) [07:33:30] T344310: Initial configurations for blkwiktionary - https://phabricator.wikimedia.org/T344310 [07:33:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [07:34:06] taavi: Thanks :) [07:34:21] (03Merged) 10jenkins-bot: Added extended confirmed on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [07:34:52] !log taavi@deploy1002 Started scap: Backport for [[gerrit:888736|Added extended confirmed on nlwiki (T329642)]] [07:34:56] T329642: Implementing extended confirmed on nlwiki - https://phabricator.wikimedia.org/T329642 [07:36:32] !log taavi@deploy1002 bmdehaan and taavi: Backport for [[gerrit:888736|Added extended confirmed on nlwiki (T329642)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:36:55] please test [07:37:12] taavi: doing [07:38:14] 10SRE-OnFire, 10Incident Tooling, 10User-Joe: vopsbot incorrectly handles users with multiple teams - https://phabricator.wikimedia.org/T344316 (10Joe) p:05Triage→03Medium [07:41:58] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 211, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:42:20] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:42:54] Daniuu: how it is going? [07:45:39] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [07:46:23] Just tested on my nlwiki notepad workes as intended [07:46:28] !log taavi@deploy1002 bmdehaan and taavi: Continuing with sync [07:48:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [07:48:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3008.esams.wmnet with OS bullseye [07:52:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet [07:53:07] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:888736|Added extended confirmed on nlwiki (T329642)]] (duration: 18m 15s) [07:53:11] T329642: Implementing extended confirmed on nlwiki - https://phabricator.wikimedia.org/T329642 [07:54:25] Seems to work as expected on production environment [08:03:53] !log mwscript extensions/OATHAuth/maintenance/UpdateForMultipleDevicesSupport.php techconductwiki # T242031 [08:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:57] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [08:07:05] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti3006.esams.wmnet [08:11:39] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:14:24] 10SRE-OnFire, 10Incident Tooling, 10Patch-For-Review, 10User-Joe: vopsbot incorrectly handles users with multiple teams - https://phabricator.wikimedia.org/T344316 (10CodeReviewBot) oblivian opened https://gitlab.wikimedia.org/repos/sre/vopsbot/-/merge_requests/11 Allow users to be part of multiple teams [08:16:13] (03PS1) 10Muehlenhoff: confd: Explicitly require directory for systemd cleanup timer [puppet] - 10https://gerrit.wikimedia.org/r/949496 [08:16:39] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:26:13] (03PS5) 10Jelto: gitlab: remove cas support [puppet] - 10https://gerrit.wikimedia.org/r/943563 (https://phabricator.wikimedia.org/T320390) [08:27:18] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,session-c64.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:46] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42898/console" [puppet] - 10https://gerrit.wikimedia.org/r/943563 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [08:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:36:20] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: remove cas support [puppet] - 10https://gerrit.wikimedia.org/r/943563 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [08:38:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:38:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet [08:40:39] (03CR) 10David Caro: [C: 03+2] toolforge: add deployer module with the secrets [puppet] - 10https://gerrit.wikimedia.org/r/948566 (https://phabricator.wikimedia.org/T334585) (owner: 10David Caro) [08:42:29] (03PS1) 10D3r1ck01: jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949178 (https://phabricator.wikimedia.org/T344223) [08:47:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet [08:48:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [08:48:52] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1099.eqiad.wmnet with OS bullseye [08:50:42] (03CR) 10Vgutierrez: [C: 04-1] "we have several issues here, some related to the CR itself, some to acme-chief and some to our CI environment:" [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/948672 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [08:52:03] (03CR) 10Jaime Nuche: "I'm not very clear about the details, but if we merge this change, won't T344238 still happen again eventually but with more stuck ssh con" [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto) [08:53:25] (03PS1) 10Urbanecm: mw-debug-repl: Add --verbose [puppet] - 10https://gerrit.wikimedia.org/r/949499 (https://phabricator.wikimedia.org/T344323) [08:53:31] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [08:55:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:55:27] (03PS2) 10Urbanecm: mw-debug-repl: Add --verbose [puppet] - 10https://gerrit.wikimedia.org/r/949499 (https://phabricator.wikimedia.org/T344323) [08:55:31] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM overall, just don't change limits for mw-debug." [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [08:56:22] (03PS2) 10Jforrester: jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949178 (https://phabricator.wikimedia.org/T344223) (owner: 10D3r1ck01) [08:56:54] (03CR) 10Jelto: "Yes" [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto) [08:59:36] (03CR) 10Jaime Nuche: [C: 03+1] gerrit: raise maxConnectionsPerUser to 8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto) [09:00:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-debug-repl: Add --verbose [puppet] - 10https://gerrit.wikimedia.org/r/949499 (https://phabricator.wikimedia.org/T344323) (owner: 10Urbanecm) [09:00:19] (03PS1) 10Muehlenhoff: Apply ganeti role to ganeti3006/3008 [puppet] - 10https://gerrit.wikimedia.org/r/949500 [09:04:21] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1099.eqiad.wmnet with reason: host reimage [09:04:56] (03CR) 10Muehlenhoff: [C: 03+2] Apply ganeti role to ganeti3006/3008 [puppet] - 10https://gerrit.wikimedia.org/r/949500 (owner: 10Muehlenhoff) [09:07:29] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1099.eqiad.wmnet with reason: host reimage [09:08:53] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): ManagementSSHDown - https://phabricator.wikimedia.org/T344110 (10jijiki) @Jhancock.wm thank you very much for the update! [09:10:05] 10SRE-tools, 10Spicerack: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325 (10ayounsi) [09:10:46] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325 (10ayounsi) [09:10:50] 10SRE-tools, 10Infrastructure-Foundations: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045 (10ayounsi) [09:11:07] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325 (10ayounsi) [09:11:13] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [09:11:27] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325 (10ayounsi) [09:12:52] (03CR) 10Jbond: [C: 03+1] "lgtm some comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [09:14:36] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:15:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:17:44] RECOVERY - Host cr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 79.69 ms [09:17:45] RECOVERY - Host cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 79.49 ms [09:18:00] RECOVERY - Check systemd state on mw2330 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:16] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 58, down: 28, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:18:46] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Idle - Telia, AS1299/IPv6: Idle - Telia, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:19:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:48] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:07] (03CR) 10David Caro: [V: 03+1 C: 03+2] role::wmcs::monitoring: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949037 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro) [09:27:43] 10SRE-tools, 10Spicerack: Junos module in Spicerack - https://phabricator.wikimedia.org/T344326 (10ayounsi) p:05Triage→03Low [09:29:09] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1099.eqiad.wmnet with OS bullseye [09:31:27] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:32:04] PROBLEM - Check systemd state on an-worker1099 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:47] <_joe_> jouncebot: now and next [09:32:47] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [09:32:54] <_joe_> jouncebot: nowandnext [09:32:54] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [09:32:54] In 0 hour(s) and 27 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1000) [09:33:09] <_joe_> ok, I'll go on and merge the change to fix wikifunctions caching [09:34:39] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change reverse for IPs on cr2-esams that had old cr3-knams in dns names - cmooney@cumin1001" [09:35:10] _joe_: Already deployed, sorry! [09:35:27] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change reverse for IPs on cr2-esams that had old cr3-knams in dns names - cmooney@cumin1001" [09:35:27] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:35:40] !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (eqiad): Cleanup stale config references [09:35:57] !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (eqiad): Cleanup stale config references (duration: 00m 16s) [09:37:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:22] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:26] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:41:10] !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (eqiad): Cleanup stale config references [09:41:27] !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (eqiad): Cleanup stale config references (duration: 00m 17s) [09:42:16] !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (eqiad): Cleanup stale config references [09:42:25] PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [09:43:00] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp3079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:43:00] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp3076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:43:22] !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (eqiad): Cleanup stale config references (duration: 01m 06s) [09:44:18] PROBLEM - PyBal backends health check on lvs3009 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_80: Servers cp3063.esams.wmnet, cp3055.esams.wmnet, cp3061.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:44:39] !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references [09:44:49] RECOVERY - Host cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 84.23 ms [09:45:02] !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references (duration: 00m 23s) [09:45:30] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [09:45:32] RECOVERY - PyBal backends health check on lvs3009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:46:04] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp3076 is OK: HTTP OK: HTTP/1.1 200 OK - 431 bytes in 0.160 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:46:04] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp3079 is OK: HTTP OK: HTTP/1.1 200 OK - 431 bytes in 0.160 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:49:18] !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references [09:50:37] (03CR) 10Effie Mouzeli: [C: 03+1] Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 (owner: 10Jgiannelos) [09:51:00] (03PS17) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [09:51:42] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10Jelto) [09:51:50] (03CR) 10Effie Mouzeli: [C: 03+2] Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 (owner: 10Jgiannelos) [09:52:29] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) 05Open→03Resolved Cleanup of puppet code is done and most cas references are removed. I'm not sure how to move forward... [09:52:41] (03CR) 10CI reject: [V: 04-1] Force image rebuild [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948609 (owner: 10Jgiannelos) [09:52:43] (03CR) 10CI reject: [V: 04-1] Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 (owner: 10Jgiannelos) [09:54:12] !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references (duration: 04m 53s) [09:54:31] !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references [09:54:54] !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references (duration: 00m 23s) [09:56:08] !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references [09:56:25] (03PS1) 10Filippo Giunchedi: istio: clarify instructions to get the istio version [deployment-charts] - 10https://gerrit.wikimedia.org/r/949501 (https://phabricator.wikimedia.org/T344253) [09:56:32] !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references (duration: 00m 23s) [09:56:53] !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references [09:57:17] !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references (duration: 00m 23s) [09:57:34] !log jiji@deploy1002 Started deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references [09:57:58] !log jiji@deploy1002 Finished deploy [kartotherian/deploy@3325683] (codfw): Cleanup stale config references (duration: 00m 23s) [09:58:20] (03PS3) 10Jgiannelos: Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 [09:58:43] (03Abandoned) 10Jgiannelos: Force image rebuild [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948609 (owner: 10Jgiannelos) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1000) [10:00:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:00:10] RECOVERY - Check systemd state on an-worker1099 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:57] (03PS1) 10Gehel: [WIP] Start Balzegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 [10:01:38] (03CR) 10CI reject: [V: 04-1] [WIP] Start Balzegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (owner: 10Gehel) [10:02:09] Congratulations @zabe :) [10:02:29] +1, very well-deserved zabe! [10:02:30] (03PS1) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/949179 [10:03:05] (03PS2) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/949179 [10:03:25] grats! [10:06:38] 10SRE, 10Puppet-Core: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Clement_Goubert) [10:08:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:09:09] (03PS8) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) [10:09:43] (03CR) 10CI reject: [V: 04-1] puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [10:10:14] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Clement_Goubert) Re-adding #SRE and #Infrastructure-Foundations since this is cross-SRE work under IF stewardship. [10:10:31] (03PS9) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) [10:11:07] (03CR) 10CI reject: [V: 04-1] puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [10:11:29] (03PS1) 10Filippo Giunchedi: aux: add tlsHostnames for jaeger collector and query [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) [10:12:14] (03CR) 10Jbond: puppetserver: Add support for defining additional mount points (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [10:14:39] (03CR) 10Filippo Giunchedi: "I'm following this: https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress" [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [10:16:24] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:18:17] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Clement_Goubert) [10:19:14] (03CR) 10Jbond: [C: 03+1] "lgtm, ok lets just go with this 😊" [puppet] - 10https://gerrit.wikimedia.org/r/949112 (https://phabricator.wikimedia.org/T344291) (owner: 10Bking) [10:19:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:25] (03PS4) 10Effie Mouzeli: Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos) [10:20:51] (03CR) 10Effie Mouzeli: Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos) [10:21:04] 10SRE, 10Traffic: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10Vgutierrez) [10:21:37] (03Merged) 10jenkins-bot: Dependencies maintenance [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/948613 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos) [10:23:47] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Swap esams ganeti0[1|2] cluster IPs due to subnet/rack mis-allocation - cmooney@cumin1001" [10:24:17] (03PS1) 10Vgutierrez: tests: fix CertificateState tests on python 3.10+ [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949505 (https://phabricator.wikimedia.org/T344330) [10:24:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Swap esams ganeti0[1|2] cluster IPs due to subnet/rack mis-allocation - cmooney@cumin1001" [10:24:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:24:56] (03PS1) 10Jbond: confd: only rune cleanup command if directory exists [puppet] - 10https://gerrit.wikimedia.org/r/949506 [10:25:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet [10:25:16] (03CR) 10Jbond: "this lgtm however i think there is still some race conditions." [puppet] - 10https://gerrit.wikimedia.org/r/949496 (owner: 10Muehlenhoff) [10:29:58] 10SRE, 10Traffic, 10Patch-For-Review: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10Vgutierrez) [10:31:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:10] 10SRE, 10Traffic, 10Patch-For-Review: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10Vgutierrez) p:05Triage→03Medium [10:36:59] (03CR) 10Vgutierrez: [C: 04-1] Release 0.36-2 for Bookworm (031 comment) [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/948672 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [10:39:27] !log restarting haproxy service on all knams cp hosts to silence alerts [10:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:44] (03CR) 10Muehlenhoff: confd: Explicitly require directory for systemd cleanup timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949496 (owner: 10Muehlenhoff) [10:45:00] (NodeTextfileStale) firing: (2) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:45:18] PROBLEM - Host ganeti3006 is DOWN: PING CRITICAL - Packet loss = 100% [10:45:19] !log depooling maps on codfw - T344110 [10:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:22] T344110: maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 [10:45:45] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10jijiki) [10:48:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:13] 10SRE, 10Traffic, 10Patch-For-Review: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10Vgutierrez) @hashar could you clarify if T342346 would trigger having python 3.11 on CI with some kind of backport for bullseye or do you have another task tracking python 3.11 suppo... [10:50:00] (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:51:31] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [10:52:42] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Clement_Goubert) @NHillard-WMF This request requires your manager's approval, @SCherukuwada if my information is up to date, as well as approval from @odimitrijevic or @... [10:55:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Clement_Goubert) a:03NHillard-WMF [11:06:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Clement_Goubert) 05Open→03Stalled [11:06:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Clement_Goubert) 05Open→03Stalled [11:07:06] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10Clement_Goubert) 05Open→03Stalled [11:07:30] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10Clement_Goubert) 05Open→03Stalled [11:08:06] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:12:05] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3006.mgmt.esams.wmnet with reboot policy GRACEFUL [11:12:08] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949507 (https://phabricator.wikimedia.org/T344324) [11:13:50] !log imported 0.9.0-3~wmf12u1 for bookworm-wikimedia and 0.9.0-3~wmf11u1 for bullseye-wikimedia T340045 [11:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:54] T340045: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045 [11:15:26] 10SRE-tools, 10Infrastructure-Foundations: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045 (10MoritzMuehlenhoff) I've uploaded dictdiffer for Bulleye and Bookworm (since we're likely about to move the Cumin servers to Bookworm in the not too distant future) to... [11:15:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10SCherukuwada) Manager here: I approve. [11:16:30] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Clement_Goubert) [11:19:48] jouncebot: nowandnext [11:19:48] No deployments scheduled for the next 1 hour(s) and 40 minute(s) [11:19:48] In 1 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1300) [11:20:12] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3006.mgmt.esams.wmnet with reboot policy GRACEFUL [11:20:51] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Clement_Goubert) Hi, While we add your user to the base group, can you make sure you have: - Read the [[ https://wikitech.wikimedia.org/wiki/Analytic... [11:32:06] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Clement_Goubert) We would also need clarification on whether this request is also for SSH access or only via superset. [11:36:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Clement_Goubert) [11:40:11] (03CR) 10Jbond: [C: 04-1] "-1 see inline. this will also need approval fone of the wmde engineering managers[1]" [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [11:41:12] (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [11:44:25] (03CR) 10Jbond: [C: 03+1] "thanks lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/949496 (owner: 10Muehlenhoff) [11:45:22] (03CR) 10Jbond: "the problem this cr tries to fix is better fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/949496. As mentioned on that tas" [puppet] - 10https://gerrit.wikimedia.org/r/949506 (owner: 10Jbond) [11:45:26] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti3006.esams.wmnet [11:45:30] (03Abandoned) 10Jbond: confd: only rune cleanup command if directory exists [puppet] - 10https://gerrit.wikimedia.org/r/949506 (owner: 10Jbond) [11:46:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:48:50] (03CR) 10JMeybohm: [C: 03+1] istio: clarify instructions to get the istio version [deployment-charts] - 10https://gerrit.wikimedia.org/r/949501 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [11:49:24] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 2 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:49:28] RECOVERY - Host ganeti3006 is UP: PING OK - Packet loss = 0%, RTA = 78.10 ms [11:51:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:52:25] (03PS6) 10JMeybohm: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) [11:52:32] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:53:14] PROBLEM - HTTPS Ganeti RAPI esams on ganeti3006 is CRITICAL: connect to address ganeti02.svc.esams.wmnet and port 5080: No route to host https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [11:54:33] (03PS7) 10JMeybohm: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) [11:54:35] (03PS3) 10JMeybohm: Remove limits in ResourceQuota and container limitanges for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) [11:54:41] (03CR) 10JMeybohm: Remove limits in ResourceQuota and container limitanges for mediawiki (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [11:55:23] (03PS1) 10Urbanecm: revalidateLinkRecommendations: Make it possible to revalidate tasks older than [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949180 (https://phabricator.wikimedia.org/T344034) [11:55:38] (03PS1) 10Urbanecm: revalidateLinkRecommendations: Make it possible to revalidate tasks older than [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949181 (https://phabricator.wikimedia.org/T344034) [11:55:44] jouncebot: nowandnext [11:55:44] No deployments scheduled for the next 1 hour(s) and 4 minute(s) [11:55:44] In 1 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1300) [11:56:01] (03CR) 10Urbanecm: [C: 03+2] revalidateLinkRecommendations: Make it possible to revalidate tasks older than [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949180 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm) [11:56:07] (03CR) 10Urbanecm: [C: 03+2] revalidateLinkRecommendations: Make it possible to revalidate tasks older than [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949181 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm) [12:03:06] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:08:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet [12:16:03] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1100.eqiad.wmnet with OS bullseye [12:17:19] (03Merged) 10jenkins-bot: revalidateLinkRecommendations: Make it possible to revalidate tasks older than [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949180 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm) [12:17:21] (03Merged) 10jenkins-bot: revalidateLinkRecommendations: Make it possible to revalidate tasks older than [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949181 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm) [12:18:09] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949180|revalidateLinkRecommendations: Make it possible to revalidate tasks older than (T344034)]], [[gerrit:949181|revalidateLinkRecommendations: Make it possible to revalidate tasks older than (T344034)]] [12:18:13] T344034: ruwiki: Too many AddLink suggestions were generated before 'excludedSections' rule was introduced - https://phabricator.wikimedia.org/T344034 [12:19:56] RECOVERY - HTTPS Ganeti RAPI esams on ganeti3006 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.015 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [12:20:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27 [12:20:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27 [12:20:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet [12:22:38] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2023-08-19 04:23:22 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:22:52] (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949507 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [12:24:14] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2023-10-18 03:52:32 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:26:19] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3005.mgmt.esams.wmnet with reboot policy GRACEFUL [12:26:30] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949180|revalidateLinkRecommendations: Make it possible to revalidate tasks older than (T344034)]], [[gerrit:949181|revalidateLinkRecommendations: Make it possible to revalidate tasks older than (T344034)]] (duration: 08m 20s) [12:26:33] T344034: ruwiki: Too many AddLink suggestions were generated before 'excludedSections' rule was introduced - https://phabricator.wikimedia.org/T344034 [12:26:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:26:42] * urbanecm done [12:28:02] PROBLEM - Host ganeti3005 is DOWN: PING CRITICAL - Packet loss = 100% [12:28:42] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3006.mgmt.esams.wmnet with reboot policy GRACEFUL [12:29:13] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3007.mgmt.esams.wmnet with reboot policy GRACEFUL [12:29:49] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host ganeti3008.mgmt.esams.wmnet with reboot policy GRACEFUL [12:30:39] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1100.eqiad.wmnet with reason: host reimage [12:31:15] PROBLEM - Host ganeti3006 is DOWN: PING CRITICAL - Packet loss = 100% [12:31:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:31:41] PROBLEM - Host ganeti3007 is DOWN: PING CRITICAL - Packet loss = 100% [12:32:11] PROBLEM - Host ganeti3008 is DOWN: PING CRITICAL - Packet loss = 100% [12:33:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1100.eqiad.wmnet with reason: host reimage [12:34:12] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3005.mgmt.esams.wmnet with reboot policy GRACEFUL [12:34:13] (03PS3) 10Sergio Gimeno: GrowthExperiments: enable add a link in 11th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136) [12:34:30] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949507 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [12:35:16] (03Merged) 10jenkins-bot: tegola-vector-tiles: update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949507 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [12:36:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3006.mgmt.esams.wmnet with reboot policy GRACEFUL [12:36:32] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [12:37:05] RECOVERY - Host ganeti3006 is UP: PING OK - Packet loss = 0%, RTA = 80.50 ms [12:37:05] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3007.mgmt.esams.wmnet with reboot policy GRACEFUL [12:37:09] (03PS1) 10Urbanecm: Growth: Temporarily disable link-recommendation frontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949510 (https://phabricator.wikimedia.org/T344034) [12:37:11] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti3008.mgmt.esams.wmnet with reboot policy GRACEFUL [12:37:24] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [12:37:32] jouncebot: nowandnext [12:37:32] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [12:37:32] In 0 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1300) [12:37:48] (03CR) 10Urbanecm: [C: 03+2] "to be able to run revalidateLinkRecommendations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949510 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm) [12:37:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949510 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm) [12:37:57] RECOVERY - Host ganeti3008 is UP: PING OK - Packet loss = 0%, RTA = 81.30 ms [12:39:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27 [12:39:43] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27 [12:42:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet [12:44:36] (03CR) 10Jbond: [C: 03+2] sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/949179 (owner: 10Jbond) [12:46:04] (03Merged) 10jenkins-bot: Growth: Temporarily disable link-recommendation frontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949510 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm) [12:46:30] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949510|Growth: Temporarily disable link-recommendation frontend (T344034)]] [12:46:34] T344034: ruwiki: Too many AddLink suggestions were generated before 'excludedSections' rule was introduced - https://phabricator.wikimedia.org/T344034 [12:48:39] (03CR) 10Filippo Giunchedi: [C: 03+2] istio: clarify instructions to get the istio version [deployment-charts] - 10https://gerrit.wikimedia.org/r/949501 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [12:48:43] RECOVERY - Host ganeti3005 is UP: PING OK - Packet loss = 0%, RTA = 80.57 ms [12:48:57] 10SRE, 10Traffic, 10observability, 10Upstream: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10RhinosF1) This happened again for lists1001. Requested (and it has been) restart in #wikimedia-sre [12:49:15] RECOVERY - Host ganeti3007 is UP: PING OK - Packet loss = 0%, RTA = 80.07 ms [12:49:42] (03Merged) 10jenkins-bot: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/949179 (owner: 10Jbond) [12:50:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet [12:51:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27 [12:51:25] RECOVERY - Check systemd state on ml-serve1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:31] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:52:00] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27 [12:53:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27 [12:53:40] (03PS1) 10Anzx: Add blkwiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949512 (https://phabricator.wikimedia.org/T344310) [12:53:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27 [12:54:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:54:34] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949510|Growth: Temporarily disable link-recommendation frontend (T344034)]] (duration: 08m 04s) [12:54:38] T344034: ruwiki: Too many AddLink suggestions were generated before 'excludedSections' rule was introduced - https://phabricator.wikimedia.org/T344034 [12:55:05] dcaro: I86bdfba3e broke puppet on a bunch of hosts, including prometheus [12:55:17] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while e match for Wmflib::Ensure = Enum['absent', 'present'], got 'file' (file: /etc/puppet/modules/profile/functions/pki/get_c [12:55:33] (03PS4) 10Sergio Gimeno: GrowthExperiments: enable add a link in 11th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136) [12:58:02] !log mwscript extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --wiki=ruwiki --olderThan=1651960800 --verbose # T344034 [12:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:06] (03PS1) 10Anzx: add suwikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949513 (https://phabricator.wikimedia.org/T344314) [12:59:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27 [12:59:30] (03CR) 10Jbond: [C: 03+2] service::catalog: Add config-master to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/948560 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [12:59:30] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27 [12:59:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:59:47] dcaro: going to revert for now since I'm guessing you are at lunch [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1300). [13:00:05] sergi0, xSavitar, Krinkle, MichaelG_WMDE, and Urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:25] hello [13:00:36] o/ [13:01:12] (03PS1) 10Filippo Giunchedi: Revert "role::wmcs::monitoring: pass through the ensure option" [puppet] - 10https://gerrit.wikimedia.org/r/949182 [13:01:14] (03CR) 10Gehel: query_service: let puppet manage whitelist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949101 (https://phabricator.wikimedia.org/T343856) (owner: 10Bking) [13:02:24] Krinkle: do you want to start deploying your patch? :) [13:02:24] * MichaelG_WMDE is here [13:03:12] (03PS1) 10Jbond: service::catalog: correct discovery value [puppet] - 10https://gerrit.wikimedia.org/r/949514 (https://phabricator.wikimedia.org/T341717) [13:03:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] service::catalog: correct discovery value [puppet] - 10https://gerrit.wikimedia.org/r/949514 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [13:03:46] urbanecm: go ahead with yours if you like [13:04:04] Ill be fully available in a few :) [13:04:06] (03CR) 10CI reject: [V: 04-1] Revert "role::wmcs::monitoring: pass through the ensure option" [puppet] - 10https://gerrit.wikimedia.org/r/949182 (owner: 10Filippo Giunchedi) [13:05:11] (03PS2) 10Filippo Giunchedi: Revert "role::wmcs::monitoring: pass through the ensure option" [puppet] - 10https://gerrit.wikimedia.org/r/949182 [13:06:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:06:12] (03PS2) 10Jbond: config-master: add new discovery record for config-master [dns] - 10https://gerrit.wikimedia.org/r/948562 (https://phabricator.wikimedia.org/T341717) [13:06:42] (03CR) 10Jbond: [C: 03+2] config-master: add new discovery record for config-master [dns] - 10https://gerrit.wikimedia.org/r/948562 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [13:07:24] (03CR) 10Clément Goubert: [C: 03+1] admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [13:07:36] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "role::wmcs::monitoring: pass through the ensure option" [puppet] - 10https://gerrit.wikimedia.org/r/949182 (owner: 10Filippo Giunchedi) [13:10:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:23] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [13:14:09] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMDE and LDAP/NDA for mareikeheuer - https://phabricator.wikimedia.org/T344341 (10MareikeHeuerWMDE) [13:14:49] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136) (owner: 10Sergio Gimeno) [13:17:40] (03CR) 10Ssingh: [C: 03+1] sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/949179 (owner: 10Jbond) [13:17:45] (03Merged) 10jenkins-bot: GrowthExperiments: enable add a link in 11th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948094 (https://phabricator.wikimedia.org/T308136) (owner: 10Sergio Gimeno) [13:18:16] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:948094|GrowthExperiments: enable add a link in 11th round of wikis (T308136)]] [13:18:23] T308136: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 [13:19:54] !log urbanecm@deploy1002 sgimeno and urbanecm: Backport for [[gerrit:948094|GrowthExperiments: enable add a link in 11th round of wikis (T308136)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:20:17] testing now [13:20:26] thanks [13:21:03] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage - https://phabricator.wikimedia.org/T342247 (10Gehel) 05Open→03Resolved a:03Gehel [13:21:44] (03PS1) 10Jbond: trafficserver: update config-master to use discovery record [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717) [13:22:18] (03CR) 10CI reject: [V: 04-1] trafficserver: update config-master to use discovery record [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [13:22:20] (03PS1) 10Stevemunene: datahub: Enable OIDC to idp_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) [13:22:47] I tested 4 wikis, things looking fine on my end [13:23:16] !log urbanecm@deploy1002 sgimeno and urbanecm: Continuing with sync [13:23:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42901/console" [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [13:24:26] 10SRE-tools, 10Infrastructure-Foundations: Add warning when provision cookbook is ran without the virtualization flag on hypervisors - https://phabricator.wikimedia.org/T344342 (10ayounsi) [13:25:40] (03PS2) 10Jbond: trafficserver: update config-master to use discovery record [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717) [13:25:56] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMDE and LDAP/NDA for mareikeheuer - https://phabricator.wikimedia.org/T344341 (10Clement_Goubert) 05Open→03In progress Hi, In order to process your access request, I'm going to need @KFrancis to process your NDA email: mareike.heuer@wikimedia.de as well... [13:26:01] <_joe_> !incidents [13:26:01] 3949 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [13:26:14] 10SRE-tools, 10Infrastructure-Foundations: Add warning when provision cookbook is ran without the virtualization flag on hypervisors - https://phabricator.wikimedia.org/T344342 (10cmooney) ganeti* and cloudvirt* for sure it'd make sense to have this for [13:26:23] (03PS3) 10Urbanecm: testwikidatawiki: always show MUL in Termbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949030 (https://phabricator.wikimedia.org/T343409) (owner: 10Michael Große) [13:26:43] (03CR) 10Urbanecm: [C: 03+2] jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949178 (https://phabricator.wikimedia.org/T344223) (owner: 10D3r1ck01) [13:26:50] (03CR) 10Urbanecm: [C: 03+2] testwikidatawiki: always show MUL in Termbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949030 (https://phabricator.wikimedia.org/T343409) (owner: 10Michael Große) [13:27:19] * MichaelG_WMDE is ready to test whenever you are :) [13:27:32] (03Merged) 10jenkins-bot: testwikidatawiki: always show MUL in Termbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949030 (https://phabricator.wikimedia.org/T343409) (owner: 10Michael Große) [13:28:03] MichaelG_WMDE: will ping you :) [13:28:13] 👍 [13:28:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27 [13:29:06] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3008.esams.wmnet to cluster esams02 and group BW27 [13:29:48] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:948094|GrowthExperiments: enable add a link in 11th round of wikis (T308136)]] (duration: 11m 32s) [13:29:52] T308136: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 [13:30:12] sergi0: should be live :) [13:30:37] urbanecm: 🎉 ty! [13:30:40] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949030|testwikidatawiki: always show MUL in Termbox (T343409)]] [13:30:41] np [13:30:46] T343409: MUL - Configure Test Wikidata to full-rollout mode - https://phabricator.wikimedia.org/T343409 [13:32:22] !log urbanecm@deploy1002 migr and urbanecm: Backport for [[gerrit:949030|testwikidatawiki: always show MUL in Termbox (T343409)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:32:39] MichaelG_WMDE: please test! [13:33:00] * MichaelG_WMDE tests [13:33:41] urbanecm: It works, thank you! [13:33:46] great, proceeding [13:33:50] !log urbanecm@deploy1002 migr and urbanecm: Continuing with sync [13:35:27] godog: thanks, doctor appointment, weird :/, pcc was clear iirc, I'll recheck when I'm back [13:36:12] (03PS2) 10Urbanecm: Add blkwiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949512 (https://phabricator.wikimedia.org/T344310) (owner: 10Anzx) [13:36:17] (03PS2) 10Urbanecm: add suwikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949513 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [13:36:21] (03CR) 10Urbanecm: [C: 03+2] Add blkwiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949512 (https://phabricator.wikimedia.org/T344310) (owner: 10Anzx) [13:36:23] dcaro: sure take your time, all is well after the revert, I don't think PCC can catch these failures [13:36:23] (03PS1) 10Muehlenhoff: Add ganeti02 cluster for esams [puppet] - 10https://gerrit.wikimedia.org/r/949517 [13:36:25] (03CR) 10Urbanecm: [C: 03+2] add suwikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949513 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [13:36:33] (03CR) 10Nskaggs: [C: 03+2] Add dr0ptp4kt to wmcs-admin [puppet] - 10https://gerrit.wikimedia.org/r/947028 (https://phabricator.wikimedia.org/T343862) (owner: 10Dr0ptp4kt) [13:37:19] (03Merged) 10jenkins-bot: Add blkwiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949512 (https://phabricator.wikimedia.org/T344310) (owner: 10Anzx) [13:37:26] (03Merged) 10jenkins-bot: add suwikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949513 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [13:38:08] (03CR) 10Ayounsi: [C: 03+1] Add ganeti02 cluster for esams [puppet] - 10https://gerrit.wikimedia.org/r/949517 (owner: 10Muehlenhoff) [13:40:16] (03PS2) 10Urbanecm: Growth: Enable new Impact backend on large Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949033 (https://phabricator.wikimedia.org/T344143) [13:40:19] (03CR) 10Urbanecm: [C: 03+2] Growth: Enable new Impact backend on large Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949033 (https://phabricator.wikimedia.org/T344143) (owner: 10Urbanecm) [13:40:23] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949030|testwikidatawiki: always show MUL in Termbox (T343409)]] (duration: 09m 43s) [13:40:29] T343409: MUL - Configure Test Wikidata to full-rollout mode - https://phabricator.wikimedia.org/T343409 [13:41:01] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949512|Add blkwiktionary logo (T344310)]], [[gerrit:949513|add suwikisource logo (T344314)]] [13:41:06] T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314 [13:41:06] T344310: Initial configurations for blkwiktionary - https://phabricator.wikimedia.org/T344310 [13:41:18] aanzx: your patch's next [13:41:23] Ok [13:42:38] (03PS1) 10Btullis: Remove the override for rocm version for buster hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/949518 (https://phabricator.wikimedia.org/T332570) [13:42:40] !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:949512|Add blkwiktionary logo (T344310)]], [[gerrit:949513|add suwikisource logo (T344314)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:42:50] Testing [13:43:23] thanks [13:43:39] I now also see my changes on the live servers. Thanks! 🎉 [13:43:40] (03Merged) 10jenkins-bot: jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949178 (https://phabricator.wikimedia.org/T344223) (owner: 10D3r1ck01) [13:43:43] MichaelG_WMDE: awesome [13:43:46] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1101.eqiad.wmnet with OS bullseye [13:44:07] (03CR) 10Btullis: [C: 03+2] Remove the override for rocm version for buster hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/949518 (https://phabricator.wikimedia.org/T332570) (owner: 10Btullis) [13:44:12] (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti02 cluster for esams [puppet] - 10https://gerrit.wikimedia.org/r/949517 (owner: 10Muehlenhoff) [13:44:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10joanna_borun) Approved [13:44:21] (03CR) 10Ssingh: [C: 03+2] devices: add anycast_ and lvs_neigbhors for esams (bw27/by27) [homer/public] - 10https://gerrit.wikimedia.org/r/949100 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:44:23] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:52] urbanecm: both logos are good [13:44:57] great, proceeding [13:44:58] !log urbanecm@deploy1002 urbanecm and anzx: Continuing with sync [13:46:38] urbanecm: there is one patch i have added to calendar would it be merged now , or I should reschedule it for later [13:46:44] !log running homer on asw1-b*27-esams* for CR 949100: T329219 [13:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:48] T329219: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 [13:47:02] aanzx: we're running very short on time, please reschedule for later. [13:47:19] (03PS1) 10Muehlenhoff: sre.ganeti.makevm: Add esams to list of DC with per-rack VLANs [cookbooks] - 10https://gerrit.wikimedia.org/r/949519 [13:47:20] Ok thanks, will reschedule it [13:48:25] (03CR) 10Urbanecm: [C: 03+2] Growth: Enable new Impact backend on large Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949033 (https://phabricator.wikimedia.org/T344143) (owner: 10Urbanecm) [13:48:29] (03CR) 10Ayounsi: [C: 03+1] sre.ganeti.makevm: Add esams to list of DC with per-rack VLANs [cookbooks] - 10https://gerrit.wikimedia.org/r/949519 (owner: 10Muehlenhoff) [13:49:04] (03Merged) 10jenkins-bot: Growth: Enable new Impact backend on large Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949033 (https://phabricator.wikimedia.org/T344143) (owner: 10Urbanecm) [13:50:20] (03CR) 10Ssingh: [C: 03+2] esams/ntp: point to dns3003 [dns] - 10https://gerrit.wikimedia.org/r/949113 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:50:24] (03PS2) 10Ssingh: esams/ntp: point to dns3003 [dns] - 10https://gerrit.wikimedia.org/r/949113 (https://phabricator.wikimedia.org/T329219) [13:50:46] (03PS1) 10Jbond: config_master: remove_default_ports and add modules [puppet] - 10https://gerrit.wikimedia.org/r/949521 (https://phabricator.wikimedia.org/T341717) [13:51:05] (03PS1) 10Muehlenhoff: Fix profile title [puppet] - 10https://gerrit.wikimedia.org/r/949522 [13:51:38] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949512|Add blkwiktionary logo (T344310)]], [[gerrit:949513|add suwikisource logo (T344314)]] (duration: 10m 37s) [13:51:43] (03CR) 10Ssingh: [V: 03+2] esams/ntp: point to dns3003 [dns] - 10https://gerrit.wikimedia.org/r/949113 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:51:43] T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314 [13:51:44] T344310: Initial configurations for blkwiktionary - https://phabricator.wikimedia.org/T344310 [13:51:57] aanzx: should be live. [13:51:59] !log running authdns-update for CR 949113 [13:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:19] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949178|jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses (T344223 T343291)]], [[gerrit:949033|Growth: Enable new Impact backend on large Wikipedias (T344143)]] [13:52:20] Krinkle: xSavitar: starting scap for your backport now. [13:52:25] ack [13:52:28] T344143: New Impact module: Run backend updating logic on all Wikipedias - https://phabricator.wikimedia.org/T344143 [13:52:28] T344223: User logging in on mw-on-k8s triggers "RuntimeException: firejail is enabled, but cannot be found" - https://phabricator.wikimedia.org/T344223 [13:52:28] T343291: [betacluster] Cannot login - UserLogin RuntimeException: Failed to run getConfiguration.php - https://phabricator.wikimedia.org/T343291 [13:52:31] !log restart pybal on lvs3008 [13:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:37] (03CR) 10Ayounsi: [C: 03+1] Fix profile title [puppet] - 10https://gerrit.wikimedia.org/r/949522 (owner: 10Muehlenhoff) [13:52:47] urbanecm: it is live , thanks [13:53:22] np [13:53:55] !log urbanecm@deploy1002 urbanecm and d3r1ck01: Backport for [[gerrit:949178|jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses (T344223 T343291)]], [[gerrit:949033|Growth: Enable new Impact backend on large Wikipedias (T344143)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessibl [13:53:55] e via k8s-experimental XWD option) [13:54:16] Krinkle: can you test? [13:56:04] (03CR) 10JHathaway: puppetserver: Add support for defining additional mount points (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:56:11] testing.. [13:56:16] (03CR) 10Muehlenhoff: [C: 03+2] Fix profile title [puppet] - 10https://gerrit.wikimedia.org/r/949522 (owner: 10Muehlenhoff) [13:56:39] (03PS1) 10Ssingh: common: update ntp_servers with dns300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/949525 (https://phabricator.wikimedia.org/T329219) [13:56:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:14] (03CR) 10CI reject: [V: 04-1] common: update ntp_servers with dns300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/949525 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:57:51] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.makevm: Add esams to list of DC with per-rack VLANs [cookbooks] - 10https://gerrit.wikimedia.org/r/949519 (owner: 10Muehlenhoff) [13:58:41] (03PS2) 10Ssingh: common: update ntp_servers with dns300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/949525 (https://phabricator.wikimedia.org/T329219) [13:59:53] urbanecm: LGTM [13:59:59] ty, proceeding [14:00:02] !log urbanecm@deploy1002 urbanecm and d3r1ck01: Continuing with sync [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1400) [14:00:20] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1101.eqiad.wmnet with reason: host reimage [14:00:53] actually... Krinkle: i just saw `Object of class MediaWiki\JobQueue\JobQueueGroupFactory could not be converted to string` in logs. sounds like an issue to me? [14:01:17] urbanecm: that's me messing about on eval.php [14:01:19] ah [14:01:45] missed that detail. continuing :) [14:03:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1101.eqiad.wmnet with reason: host reimage [14:04:05] (03PS1) 10Jbond: release: add additional instructions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/949527 [14:04:49] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMDE and LDAP/NDA for mareikeheuer - https://phabricator.wikimedia.org/T344341 (10Tobi_WMDE_SW) >>! In T344341#9096283, @Clement_Goubert wrote: > Hi, > > In order to process your access request, I'm going to need @KFrancis to process your NDA (email: mareike... [14:05:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host netflow3003.esams.wmnet [14:05:02] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:05:05] (03CR) 10Jbond: [C: 03+2] config_master: remove_default_ports and add modules [puppet] - 10https://gerrit.wikimedia.org/r/949521 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [14:06:32] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949178|jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses (T344223 T343291)]], [[gerrit:949033|Growth: Enable new Impact backend on large Wikipedias (T344143)]] (duration: 14m 13s) [14:06:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:06:39] T344143: New Impact module: Run backend updating logic on all Wikipedias - https://phabricator.wikimedia.org/T344143 [14:06:39] T344223: User logging in on mw-on-k8s triggers "RuntimeException: firejail is enabled, but cannot be found" - https://phabricator.wikimedia.org/T344223 [14:06:40] T343291: [betacluster] Cannot login - UserLogin RuntimeException: Failed to run getConfiguration.php - https://phabricator.wikimedia.org/T343291 [14:06:46] should all be live. [14:07:05] (03PS1) 10Urbanecm: Revert "Growth: Temporarily disable link-recommendation frontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949184 (https://phabricator.wikimedia.org/T344034) [14:07:14] (03PS2) 10Urbanecm: Revert "Growth: Temporarily disable link-recommendation frontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949184 (https://phabricator.wikimedia.org/T344034) [14:07:57] (03CR) 10Urbanecm: [C: 03+2] Revert "Growth: Temporarily disable link-recommendation frontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949184 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm) [14:08:37] (03Merged) 10jenkins-bot: Revert "Growth: Temporarily disable link-recommendation frontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949184 (https://phabricator.wikimedia.org/T344034) (owner: 10Urbanecm) [14:09:09] (03PS1) 10Muehlenhoff: Remove bastion role from bast3006 (will be replaced by bast3007) [puppet] - 10https://gerrit.wikimedia.org/r/949528 [14:09:22] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow3003.esams.wmnet - jmm@cumin2002" [14:10:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow3003.esams.wmnet - jmm@cumin2002" [14:10:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:10:10] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow3003.esams.wmnet on all recursors [14:10:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow3003.esams.wmnet on all recursors [14:10:23] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:10:40] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949184|Revert "Growth: Temporarily disable link-recommendation frontend" (T344034)]] [14:10:44] T344034: ruwiki: Too many AddLink suggestions were generated before 'excludedSections' rule was introduced - https://phabricator.wikimedia.org/T344034 [14:11:11] (03CR) 10Muehlenhoff: [C: 03+2] Remove bastion role from bast3006 (will be replaced by bast3007) [puppet] - 10https://gerrit.wikimedia.org/r/949528 (owner: 10Muehlenhoff) [14:11:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:11:38] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:23] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:12:27] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:949184|Revert "Growth: Temporarily disable link-recommendation frontend" (T344034)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:12:49] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host netflow3003.esams.wmnet [14:13:02] !log urbanecm@deploy1002 urbanecm: Continuing with sync [14:15:30] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:16:38] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:54] (03PS1) 10Ssingh: hiera: LVS: update tagged_subnets for esams [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) [14:16:57] (03PS1) 10David Caro: p:tlsproxy::envoy: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949530 (https://phabricator.wikimedia.org/T344242) [14:17:11] (03CR) 10Thcipriani: release: add additional instructions (032 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/949527 (owner: 10Jbond) [14:17:45] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42903/console" [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [14:17:49] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/949530 should be the new patch, pcc would have caught it but I only tested with the hosts that have 'ensure' set to false, so the default branch did not get tested [14:18:17] (03CR) 10Ssingh: [V: 03+1] "PCC failing because of missing facts, let me update them." [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [14:18:50] !log sukhe@puppetmaster1001:~$ sudo /usr/local/sbin/puppet-facts-upload --proxy http://webproxy.eqiad.wmnet:8080 [14:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:19:47] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949184|Revert "Growth: Temporarily disable link-recommendation frontend" (T344034)]] (duration: 09m 06s) [14:19:50] T344034: ruwiki: Too many AddLink suggestions were generated before 'excludedSections' rule was introduced - https://phabricator.wikimedia.org/T344034 [14:19:54] done all [14:20:40] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10Clement_Goubert) 05Open→03In progress [14:20:57] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10Clement_Goubert) 05Open→03In progress [14:22:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10Clement_Goubert) 05Open→03In progress [14:23:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1100.eqiad.wmnet with OS bullseye [14:24:10] (03CR) 10Nskaggs: [C: 03+2] Add Nicholas as approver for wmcs-admin [puppet] - 10https://gerrit.wikimedia.org/r/947319 (owner: 10Muehlenhoff) [14:24:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host netflow3003.esams.wmnet [14:24:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:24:34] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:25:04] !log ssh pcc-db1001.puppet-diffs.eqiad1.wikimedia.cloud sudo -u jenkins-deploy /usr/local/sbin/pcc_facts_processor [14:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:48] claime: hi, can i bother you to do `systemctl start mediawiki_job_growthexperiments-userImpactUpdateRecentlyEdited` and `systemctl start mediawiki_job_growthexperiments-userImpactUpdateRecentlyRegistered` on mwmaint1002 please? i enabled the jobs on a couple of additional wikis and i'd like to observe how well they cope with the added work. thanks! [14:25:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:25:53] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow3003.esams.wmnet on all recursors [14:25:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow3003.esams.wmnet on all recursors [14:26:17] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow3003.esams.wmnet - jmm@cumin2002" [14:26:59] urbanecm: sure thing [14:27:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow3003.esams.wmnet - jmm@cumin2002" [14:27:13] ty. task is T344143, if you need that info :) [14:27:13] T344143: New Impact module: Run backend updating logic on all Wikipedias - https://phabricator.wikimedia.org/T344143 [14:28:02] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,session-c64.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow3003.esams.wmnet with OS bookworm [14:29:18] urbanecm: Launched, I suppose it's normal they're not giving control back, they're supposed to run front? [14:29:19] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T344353 (10phaultfinder) [14:30:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1101.eqiad.wmnet with OS bullseye [14:31:01] claime: good question, i'm not privy to systemctl commands, so I'm not sure how they behave. but i see the job's running now. [14:31:36] yep, it's not a problem if they don't run in the background, I launched them in a tmux [14:31:58] are they supposed to be run on a timer usually or something ? [14:33:26] Yeah, according to puppet they're periodic jobs [14:33:50] yeah, it's a timer-based job. i wanted them to run now, so i can better monitor for possible issues, given i enabled them for our biggest wikis today. [14:34:09] urbanecm: ack. mediawiki_job_growthexperiments-userImpactUpdateRecentlyRegistered just finished running [14:34:18] great, thanks. [14:34:33] And so did mediawiki_job_growthexperiments-userImpactUpdateRecentlyEdited [14:35:00] thanks again. logs look good so far. i'll monitor logstash for a bit. [14:35:00] (03PS3) 10Hnowlan: deployment_server: add new service geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947862 (https://phabricator.wikimedia.org/T336400) [14:36:47] (03PS3) 10Hnowlan: WIP helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) [14:38:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:52] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:41:00] RECOVERY - config-master.wikimedia.org requires authentication on config-master1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:41:14] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:57] (03CR) 10Ayounsi: [C: 03+1] common: update ntp_servers with dns300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/949525 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [14:42:40] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:49:58] 10SRE, 10ops-knams, 10DC-Ops: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) [14:50:00] (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:51:06] (03PS1) 10Ssingh: P:pybal: update bgp-peer-address for asw1-b*27-esams [puppet] - 10https://gerrit.wikimedia.org/r/949531 (https://phabricator.wikimedia.org/T329219) [14:51:13] 10SRE, 10Observability-Alerting: Missing 'notify' for some Icinga configuration files - https://phabricator.wikimedia.org/T263027 (10fgiunchedi) While investigating the ns2-v4 not being removed (cc @andrea.denisse @ssingh ) today, this is the log: ` Aug 11 16:28:31 alert1001 puppet-agent[4665]: Applying confi... [14:51:41] !log registry* - upgrade jwt-authorizer package on all 4 hosts to version 1.1.1-1 - T337474 [14:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:49] 10SRE, 10ops-knams, 10DC-Ops: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ayounsi) [14:51:54] T337474: Replace deprecated `CI_JOB_JWT` CI variable in Kokkuri - https://phabricator.wikimedia.org/T337474 [14:52:52] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow3003.esams.wmnet with reason: host reimage [14:54:13] (03PS1) 10Ayounsi: Update netflow collector IP [homer/public] - 10https://gerrit.wikimedia.org/r/949533 (https://phabricator.wikimedia.org/T329219) [14:56:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow3003.esams.wmnet with reason: host reimage [14:57:59] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@155299c] (releasing): (no justification provided) [14:58:29] (03PS1) 10Ayounsi: Update esams netflow collector [puppet] - 10https://gerrit.wikimedia.org/r/949534 (https://phabricator.wikimedia.org/T329219) [14:58:40] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@155299c] (releasing): (no justification provided) (duration: 00m 41s) [15:00:05] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast3006.wikimedia.org [15:00:24] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1102.eqiad.wmnet with OS bullseye [15:00:41] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1103.eqiad.wmnet with OS bullseye [15:00:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:01:39] (03PS1) 10Jbond: admin: add taavi to ops group [puppet] - 10https://gerrit.wikimedia.org/r/949536 (https://phabricator.wikimedia.org/T342307) [15:01:53] 10SRE, 10Math, 10RESTbase Sunsetting, 10Traffic: Determin the cause of a sudden 80% drop in requests to math endpoints - https://phabricator.wikimedia.org/T344329 (10daniel) [15:02:44] (03PS1) 10Muehlenhoff: Add netflow3003 to Ferm rules for Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/949537 (https://phabricator.wikimedia.org/T344355) [15:02:57] RECOVERY - config-master.wikimedia.org requires authentication on config-master2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:03:06] 10SRE, 10Math, 10RESTbase Sunsetting, 10Traffic: Determin the cause of x8 increase in requests to math endpoints between july 6 and August 3 - https://phabricator.wikimedia.org/T344329 (10daniel) [15:03:11] (03CR) 10Hnowlan: [C: 03+1] wikifeeds: Use GET instead of POST for mwapi requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/949046 (https://phabricator.wikimedia.org/T343950) (owner: 10Jgiannelos) [15:04:18] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:04:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/949534 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [15:04:36] (03CR) 10Jbond: "ready" [puppet] - 10https://gerrit.wikimedia.org/r/949536 (https://phabricator.wikimedia.org/T342307) (owner: 10Jbond) [15:04:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [homer/public] - 10https://gerrit.wikimedia.org/r/949533 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [15:05:30] (03Abandoned) 10Muehlenhoff: Add netflow3003 to Ferm rules for Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/949537 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [15:05:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:08:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:08:53] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3006.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:09:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3006.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:09:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:09:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast3006.wikimedia.org [15:10:07] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast3006.wikimedia.org` - bast3006.wikimedia.org (**PASS**) - Downt... [15:10:28] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ping3003.esams.wmnet [15:13:19] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: update image on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/949540 (https://phabricator.wikimedia.org/T344324) [15:13:21] (03PS1) 10Muehlenhoff: Remove bast3006/ping3003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949541 (https://phabricator.wikimedia.org/T344355) [15:14:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow3003.esams.wmnet with OS bookworm [15:14:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow3003.esams.wmnet [15:14:29] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:14:33] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1103.eqiad.wmnet with reason: host reimage [15:14:55] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1102.eqiad.wmnet with reason: host reimage [15:14:59] (03CR) 10Ayounsi: [C: 03+1] Remove bast3006/ping3003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949541 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [15:15:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:17:27] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1103.eqiad.wmnet with reason: host reimage [15:18:04] (03PS10) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) [15:18:20] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:18:35] (03CR) 10Ayounsi: "1 comment but lgtm otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:18:46] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) [15:18:54] (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: update image on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/949540 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [15:18:59] (03CR) 10Jbond: "ready" [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:19:04] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: update image on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/949540 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [15:19:08] (03CR) 10Muehlenhoff: [C: 03+2] Remove bast3006/ping3003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949541 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [15:19:15] (03CR) 10Ssingh: [V: 03+1] hiera: LVS: update tagged_subnets for esams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:19:17] (03CR) 10Ayounsi: [C: 03+2] Update netflow collector IP [homer/public] - 10https://gerrit.wikimedia.org/r/949533 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [15:19:39] (03CR) 10Ayounsi: [C: 03+2] Update esams netflow collector [puppet] - 10https://gerrit.wikimedia.org/r/949534 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [15:19:56] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1102.eqiad.wmnet with reason: host reimage [15:20:02] (03CR) 10Ayounsi: [C: 03+1] hiera: LVS: update tagged_subnets for esams [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:20:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:24] (03Merged) 10jenkins-bot: tegola-vector-tiles: update image on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/949540 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [15:20:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:20:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:20:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping3003.esams.wmnet [15:20:41] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ping3003.esams.wmnet` - ping3003.esams.wmnet (**PASS**) - Downtimed... [15:21:11] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) [15:21:31] (03CR) 10Jbond: release: add additional instructions (032 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/949527 (owner: 10Jbond) [15:21:33] (03PS1) 10Fabfur: hiera: decommission dns3001 and dns3002 [puppet] - 10https://gerrit.wikimedia.org/r/949542 (https://phabricator.wikimedia.org/T329219) [15:21:56] (03PS2) 10Jbond: release: add additional instructions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/949527 [15:23:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:24:24] (03CR) 10BCornwall: [C: 03+1] tests: fix CertificateState tests on python 3.10+ [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949505 (https://phabricator.wikimedia.org/T344330) (owner: 10Vgutierrez) [15:25:27] (03PS1) 10Muehlenhoff: New install server for new esams [puppet] - 10https://gerrit.wikimedia.org/r/949543 (https://phabricator.wikimedia.org/T344355) [15:27:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:27:18] (03CR) 10Muehlenhoff: [C: 03+2] New install server for new esams [puppet] - 10https://gerrit.wikimedia.org/r/949543 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [15:28:21] 10SRE, 10Traffic: Q1:unified decommission task for old esams hosts (knams migration) - https://phabricator.wikimedia.org/T344363 (10ssingh) [15:28:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install3003.wikimedia.org [15:28:59] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:29:16] (03PS1) 10BCornwall: Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) [15:29:35] (03CR) 10CI reject: [V: 04-1] Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [15:29:39] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10jbond) >>! In T343039#9068357, @Marostegui wrote: > We really need to come up with a way to be able to grant root access to clouddb* hosts that doesn't imply root on al... [15:30:13] (03CR) 10BCornwall: Release 0.36-2 for Bookworm (032 comments) [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/948672 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [15:30:40] (03PS2) 10Ssingh: P:pybal: update bgp-peer-address for asw1-b*27-esams [puppet] - 10https://gerrit.wikimedia.org/r/949531 (https://phabricator.wikimedia.org/T329219) [15:30:58] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:09] (03PS2) 10Jbond: P:puppetserver: add support for extra_mounts [puppet] - 10https://gerrit.wikimedia.org/r/948607 (https://phabricator.wikimedia.org/T341056) [15:33:10] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install3003.wikimedia.org - jmm@cumin2002" [15:33:16] (03PS2) 10Jbond: puppetserver: add volatile file mount [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) [15:33:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install3003.wikimedia.org - jmm@cumin2002" [15:33:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:33:57] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install3003.wikimedia.org on all recursors [15:34:00] (03CR) 10Ssingh: [C: 03+2] common: update ntp_servers with dns300[34] [homer/public] - 10https://gerrit.wikimedia.org/r/949525 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:34:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install3003.wikimedia.org on all recursors [15:34:29] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install3003.wikimedia.org - jmm@cumin2002" [15:35:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install3003.wikimedia.org - jmm@cumin2002" [15:35:34] (03CR) 10Jbond: P:puppetserver: add support for extra_mounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948607 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:36:02] (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:36:17] (03CR) 10Ayounsi: P:pybal: update bgp-peer-address for asw1-b*27-esams (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/949531 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:37:18] (03PS1) 10Esanders: Disable upcoming wgMFShowEditNotices in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949545 (https://phabricator.wikimedia.org/T312587) [15:38:04] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10nskaggs) > @nskaggs As the group owner are you able to approve this request Yes, I approve. [15:38:16] (03PS3) 10Ssingh: P:pybal: update bgp-peer-address for asw1-b*27-esams [puppet] - 10https://gerrit.wikimedia.org/r/949531 (https://phabricator.wikimedia.org/T329219) [15:38:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install3003.wikimedia.org with OS bullseye [15:39:16] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts netflow3002.esams.wmnet [15:39:17] (03CR) 10Ayounsi: [C: 03+1] P:pybal: update bgp-peer-address for asw1-b*27-esams [puppet] - 10https://gerrit.wikimedia.org/r/949531 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:39:58] !log homer "mr*" commit "add ntp_servers add dns300[34]" [15:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:40:49] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: LVS: update tagged_subnets for esams [puppet] - 10https://gerrit.wikimedia.org/r/949529 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:41:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1103.eqiad.wmnet with OS bullseye [15:42:08] (03CR) 10Ssingh: [C: 03+2] P:pybal: update bgp-peer-address for asw1-b*27-esams [puppet] - 10https://gerrit.wikimedia.org/r/949531 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:43:17] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:43:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1102.eqiad.wmnet with OS bullseye [15:44:36] jouncebot: nowandnext [15:44:36] No deployments scheduled for the next 1 hour(s) and 15 minute(s) [15:44:36] In 1 hour(s) and 15 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1700) [15:45:18] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:45:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:45:51] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42904/console" [puppet] - 10https://gerrit.wikimedia.org/r/949530 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro) [15:46:38] (JobUnavailable) firing: Reduced availability for job fastnetmon in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:46:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:46:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:46:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow3002.esams.wmnet [15:46:59] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `netflow3002.esams.wmnet` - netflow3002.esams.wmnet (**PASS**) - Dow... [15:47:17] (03CR) 10David Caro: [V: 03+1] "Tested now with one of enabling and one disabling envoy." [puppet] - 10https://gerrit.wikimedia.org/r/949530 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro) [15:47:18] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) [15:48:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:28] !log restart pybal on new lvses in esams [15:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:30] (03PS1) 10BryanDavis: shellbox: Bump to 2023-08-15-040901 [deployment-charts] - 10https://gerrit.wikimedia.org/r/949548 (https://phabricator.wikimedia.org/T335460) [15:51:20] legoktm: Do you have any practical advice for how to test shellbox containers in staging? Asking in reference to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/949548/ [15:52:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install3003.wikimedia.org with reason: host reimage [15:52:42] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:32] (03CR) 10Ssingh: [C: 03+1] hiera: decommission dns3001 and dns3002 [puppet] - 10https://gerrit.wikimedia.org/r/949542 (https://phabricator.wikimedia.org/T329219) (owner: 10Fabfur) [15:56:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install3003.wikimedia.org with reason: host reimage [15:56:42] (03PS1) 10BCornwall: Remove esams hosts prior to knams migration [puppet] - 10https://gerrit.wikimedia.org/r/949551 (https://phabricator.wikimedia.org/T329219) [15:57:51] (03PS1) 10Muehlenhoff: Make install3003 the new install server for esams [puppet] - 10https://gerrit.wikimedia.org/r/949552 (https://phabricator.wikimedia.org/T344355) [15:58:30] (03CR) 10Fabfur: [C: 03+2] hiera: decommission dns3001 and dns3002 [puppet] - 10https://gerrit.wikimedia.org/r/949542 (https://phabricator.wikimedia.org/T329219) (owner: 10Fabfur) [16:00:23] !log running puppet-agent on A:cumin A:dns-rec A:netbox to remove dns3001 and dns3002 [16:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:19] (03PS11) 10Jbond: puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) [16:02:21] (03PS3) 10Jbond: P:puppetserver: add support for extra_mounts [puppet] - 10https://gerrit.wikimedia.org/r/948607 (https://phabricator.wikimedia.org/T341056) [16:02:23] (03PS3) 10Jbond: puppetserver: add volatile file mount [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) [16:04:14] (03CR) 10Majavah: "<3" [puppet] - 10https://gerrit.wikimedia.org/r/949536 (https://phabricator.wikimedia.org/T342307) (owner: 10Jbond) [16:05:18] (03PS2) 10BCornwall: Remove esams hosts prior to knams migration [puppet] - 10https://gerrit.wikimedia.org/r/949551 (https://phabricator.wikimedia.org/T344363) [16:05:20] (03CR) 10Jbond: [C: 03+2] admin: add taavi to ops group [puppet] - 10https://gerrit.wikimedia.org/r/949536 (https://phabricator.wikimedia.org/T342307) (owner: 10Jbond) [16:05:55] (03CR) 10Filippo Giunchedi: [C: 03+1] p:tlsproxy::envoy: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949530 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro) [16:05:58] taavi FYI ^^^ is merged let me know if you want me toi run puppet anywhere specific [16:06:15] congratulations taavi! :) [16:06:30] jbond: thank you!! I think I'm fine waiting for puppet to run naturally [16:06:37] ack sgtm [16:06:40] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) [16:06:53] can I add myself to ldap/ops or do I need to request that separately? [16:07:07] taavi ill do that as well one sec [16:07:46] * jbond sees many taavi* users in ldap :) [16:08:10] that might happen if you're trying to debug the authentication system :P [16:08:14] :D [16:08:27] * urbanecm looks at the `MU test *` accounts in SUL [16:08:34] :) ok thats done now as well welcome and congrats :) [16:11:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install3003.wikimedia.org with OS bullseye [16:11:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install3003.wikimedia.org [16:12:15] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [16:12:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts doh[3001-3002].wikimedia.org [16:16:39] (JobUnavailable) resolved: Reduced availability for job fastnetmon in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:17:52] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [16:19:14] !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts dns3001.wikimedia.org [16:21:10] !log mv /var/lib/puppet/volatile/misc /home/jbond on puppetmaster1001 as it (legacy geoip data) appears unused [16:21:10] 10SRE, 10SRE-Access-Requests: Login rejected on horizon.wikimedia.org - https://phabricator.wikimedia.org/T344367 (10darthmon_wmde) [16:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:24] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh[3001-3002].wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [16:21:46] (JobUnavailable) firing: (2) Reduced availability for job fastnetmon in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:22:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh[3001-3002].wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [16:22:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:22:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh[3001-3002].wikimedia.org [16:22:37] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `doh[3001-3002].wikimedia.org` - doh3001.wikimedia.org (**PASS**)... [16:23:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts durum[3001-3002].esams.wmnet [16:23:58] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1104.eqiad.wmnet with OS bullseye [16:24:02] !log fabfur@cumin1001 START - Cookbook sre.dns.netbox [16:24:04] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1105.eqiad.wmnet with OS bullseye [16:25:07] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:25:08] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dns3001.wikimedia.org [16:25:19] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `dns3001.wikimedia.org` - dns3001.wikimedia.org (**PASS**) - Downti... [16:28:03] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [16:28:20] (03CR) 10Majavah: [C: 03+2] tools-static: Hide more Cloudflare response headers [puppet] - 10https://gerrit.wikimedia.org/r/940506 (owner: 10Lucas Werkmeister) [16:30:23] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum[3001-3002].esams.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [16:31:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:21] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum[3001-3002].esams.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [16:31:21] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:31:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum[3001-3002].esams.wmnet [16:31:32] (03PS1) 10Jbond: puppetmaster: stop creating the volatile/misc folder [puppet] - 10https://gerrit.wikimedia.org/r/949554 (https://phabricator.wikimedia.org/T341717) [16:31:35] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `durum[3001-3002].esams.wmnet` - durum3001.esams.wmnet (**PASS**)... [16:31:38] !log restarting CI Jenkins to update plugins [16:31:38] (JobUnavailable) firing: (3) Reduced availability for job bird in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:47] !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts dns3002.wikimedia.org [16:33:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:41] !log mv /var/lib/puppet/volatile/squid /home/jbond on puppetmaster1001 as it appears unused [16:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:00] (03CR) 10Ssingh: [C: 03+1] Remove esams hosts prior to knams migration [puppet] - 10https://gerrit.wikimedia.org/r/949551 (https://phabricator.wikimedia.org/T344363) (owner: 10BCornwall) [16:36:40] !log fabfur@cumin1001 START - Cookbook sre.dns.netbox [16:37:10] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:15] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ssingh) [16:38:01] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1104.eqiad.wmnet with reason: host reimage [16:38:12] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1105.eqiad.wmnet with reason: host reimage [16:39:09] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir[3001-3002].esams.wmnet [16:39:24] 10SRE, 10PyBal, 10Scap, 10Traffic, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10thcipriani) [16:40:08] (03PS4) 10Jbond: puppetserver: add volatile file mount [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) [16:40:23] (03CR) 10Jbond: puppetserver: add volatile file mount (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [16:40:23] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts ncredir[3001-3002].esams.wmnet [16:40:29] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns3002.wikimedia.org decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001" [16:41:21] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns3002.wikimedia.org decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001" [16:41:21] !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:41:22] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns3002.wikimedia.org [16:41:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1104.eqiad.wmnet with reason: host reimage [16:41:38] (JobUnavailable) firing: (4) Reduced availability for job bird in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:42:05] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `dns3002.wikimedia.org` - dns3002.wikimedia.org (**PASS**) - Downti... [16:42:51] (03CR) 10BCornwall: [C: 03+2] Remove esams hosts prior to knams migration [puppet] - 10https://gerrit.wikimedia.org/r/949551 (https://phabricator.wikimedia.org/T344363) (owner: 10BCornwall) [16:43:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:43:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1105.eqiad.wmnet with reason: host reimage [16:43:48] !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@ec5d4cd]: T342213 [16:43:51] T342213: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 [16:44:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:44:46] (03PS1) 10Eevans: restbase: move (temporary) per-host settings back to role [puppet] - 10https://gerrit.wikimedia.org/r/949556 (https://phabricator.wikimedia.org/T339298) [16:44:47] !log btullis@deploy1002 deploy aborted: T342213 (duration: 00m 59s) [16:45:39] (03PS2) 10Eevans: restbase: move (temporary) per-host settings back to role [puppet] - 10https://gerrit.wikimedia.org/r/949556 (https://phabricator.wikimedia.org/T339298) [16:45:57] !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@ec5d4cd] (aqs): T342213 [16:46:07] (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:46:21] 10SRE, 10Traffic, 10Patch-For-Review: Q1:unified decommission task for old esams hosts (knams migration) - https://phabricator.wikimedia.org/T344363 (10Fabfur) [16:46:38] (JobUnavailable) resolved: (4) Reduced availability for job bird in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:46:49] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949556 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [16:47:07] (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:47:21] !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[3050-3053].esams.wmnet [16:47:21] PROBLEM - PyBal IPVS diff check on lvs3009 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:47:39] esams and other related hosts complaining [16:47:45] !log btullis@deploy1002 Finished deploy [analytics/aqs/deploy@ec5d4cd] (aqs): T342213 (duration: 01m 48s) [16:49:03] jynus: yeah decomissioning in progress [16:49:07] probably should downtime [16:49:18] (03CR) 10Eevans: [C: 03+2] restbase: move (temporary) per-host settings back to role [puppet] - 10https://gerrit.wikimedia.org/r/949556 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [16:49:46] (ConfdResourceFailed) firing: (24) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:50:21] PROBLEM - PyBal IPVS diff check on lvs3008 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:50:39] PROBLEM - PyBal IPVS diff check on lvs3010 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:50:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:51:13] silencing [16:51:48] !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213 [16:51:51] T342213: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 [16:52:01] !log restarting ntp service in core sites [16:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:52:59] !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213 [16:54:18] (03PS1) 10Ssingh: ncredir300x: decommission hosts in esams [puppet] - 10https://gerrit.wikimedia.org/r/949558 (https://phabricator.wikimedia.org/T344355) [16:54:47] (ConfdResourceFailed) firing: (96) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:57:00] !log btullis@deploy1002 Finished deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213 (duration: 04m 00s) [16:57:03] !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213 [16:57:07] (ProbeDown) firing: (14) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:57:08] T342213: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 [16:57:11] PROBLEM - Check systemd state on ml-serve2005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:33] (03PS2) 10Ssingh: ncredir300x: decommission hosts in esams [puppet] - 10https://gerrit.wikimedia.org/r/949558 (https://phabricator.wikimedia.org/T344355) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1700) [17:00:13] (03PS3) 10Ssingh: ncredir300x: decommission hosts in esams [puppet] - 10https://gerrit.wikimedia.org/r/949558 (https://phabricator.wikimedia.org/T344355) [17:01:52] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [17:02:19] !log brett@cumin2002 START - Cookbook sre.dns.netbox [17:04:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1104.eqiad.wmnet with OS bullseye [17:06:08] !log btullis@deploy1002 Finished deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213 (duration: 09m 04s) [17:06:11] T342213: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 [17:06:23] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3050-3053].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [17:06:52] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [17:06:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1105.eqiad.wmnet with OS bullseye [17:07:26] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3050-3053].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [17:07:26] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:07:27] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[3050-3053].esams.wmnet [17:07:39] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brett@cumin2002 for hosts: `cp[3050-3053].esams.wmnet` - cp3050.esams.wmnet (**PASS**) - Downti... [17:09:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:04] !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[3054-3057].esams.wmnet [17:14:20] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1006 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [17:14:30] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve2005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:19:24] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [17:19:44] ^ these will be resolving soon, space restarts are in progress [17:20:00] !log brett@cumin2002 START - Cookbook sre.dns.netbox [17:20:07] (03CR) 10Ssingh: "Not sure if needed but I thought I should check with you before decomm." [puppet] - 10https://gerrit.wikimedia.org/r/949558 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh) [17:22:13] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3054-3057].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [17:23:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:30] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3054-3057].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [17:23:30] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:23:31] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[3054-3057].esams.wmnet [17:23:42] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brett@cumin2002 for hosts: `cp[3054-3057].esams.wmnet` - cp3054.esams.wmnet (**PASS**) - Downti... [17:26:52] !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[3058-3061].esams.wmnet [17:27:26] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:50] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [17:31:02] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMDE and LDAP/NDA for mareikeheuer - https://phabricator.wikimedia.org/T344341 (10KFrancis) Thank you. The NDA is out for signatures. [17:33:49] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [17:36:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:38:16] !log brett@cumin2002 START - Cookbook sre.dns.netbox [17:40:17] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3058-3061].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [17:40:53] (03PS1) 10Ssingh: lvs300[5-7]: decommission old esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/949563 (https://phabricator.wikimedia.org/T344363) [17:41:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:42:07] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42905/console" [puppet] - 10https://gerrit.wikimedia.org/r/949563 (https://phabricator.wikimedia.org/T344363) (owner: 10Ssingh) [17:43:10] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3058-3061].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [17:43:10] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:43:11] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[3058-3061].esams.wmnet [17:43:20] 10SRE, 10Traffic, 10Patch-For-Review: Q1:unified decommission task for old esams hosts (knams migration) - https://phabricator.wikimedia.org/T344363 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brett@cumin2002 for hosts: `cp[3058-3061].esams.wmnet` - cp3058.esams.wmnet (**PASS**) - D... [17:45:26] !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[3062-3065].esams.wmnet [17:45:31] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [17:45:34] ^ expected [17:46:21] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [17:50:20] (03CR) 10BCornwall: [C: 03+1] lvs300[5-7]: decommission old esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/949563 (https://phabricator.wikimedia.org/T344363) (owner: 10Ssingh) [17:50:27] !log run puppet-agent on A:dns-rec to restart ntp service [17:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:20] !log restart ntp on A:dns-rec and A:edges' [17:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:39] (03CR) 10Ssingh: [V: 03+1 C: 03+2] lvs300[5-7]: decommission old esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/949563 (https://phabricator.wikimedia.org/T344363) (owner: 10Ssingh) [17:54:33] !log brett@cumin2002 START - Cookbook sre.dns.netbox [17:55:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:56:29] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3062-3065].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [17:56:38] (JobUnavailable) firing: (2) Reduced availability for job trafficserver-text in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:57:22] expected, decom [17:58:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs[3005-3007].esams.wmnet [17:58:52] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[3062-3065].esams.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [17:58:52] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:58:53] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[3062-3065].esams.wmnet [18:00:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:05] brennen and dancy: (Dis)respected human, time to deploy Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1800). Please do the needful. [18:00:05] brennen and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T1800). [18:00:11] o/ [18:00:24] Train is unblocked. Pressing the buttons. [18:00:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:01:15] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949565 (https://phabricator.wikimedia.org/T343724) [18:01:17] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949565 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot) [18:02:06] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949565 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot) [18:03:06] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:07] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:06:24] uh oh [18:06:36] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [18:06:38] probably the LVS removal [18:06:58] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Let's add the secret and test tomorrow." [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [18:06:59] but esams is depooled [18:07:01] can someone ACK it? [18:07:06] yeah it was that [18:07:07] (ProbeDown) firing: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:07:47] arnoldokoth: thanks for ACK [18:09:27] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs[3005-3007].esams.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [18:09:34] (03CR) 10Btullis: [C: 03+1] datahub: Enable OIDC to idp_test (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [18:10:14] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.22 refs T343724 [18:10:18] T343724: 1.41.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T343724 [18:10:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs[3005-3007].esams.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [18:10:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:10:32] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs[3005-3007].esams.wmnet [18:10:38] I'm going to let the train marinate on group0 for an hour. [18:10:43] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs[3005-3007].esams.wmnet` - lvs3005.esams.wmnet (**PASS**) - Do... [18:11:39] (JobUnavailable) firing: (3) Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:12:16] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:15:49] 10SRE-swift-storage, 10Data-Persistence, 10Discovery-Search (Current work): Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10bking) [18:16:39] (JobUnavailable) firing: (3) Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:16:49] (03PS1) 10Eevans: restbase: set legacy ssl port & optional encryption to false [puppet] - 10https://gerrit.wikimedia.org/r/949587 (https://phabricator.wikimedia.org/T339298) [18:18:14] sukhe: np. [18:19:40] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:19:42] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:21:33] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949587 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [18:26:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:29:12] (03PS1) 10Bartosz Dziewoński: Clarify 2017 wikitext editor's Beta Feature status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949588 (https://phabricator.wikimedia.org/T344158) [18:30:12] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:33:50] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:42:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:45:40] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:44] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:46:34] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:48:10] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:50:00] (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:52:06] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:52:34] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [18:59:06] (03PS1) 10Bartosz Dziewoński: Remove unusual VisualEditor config for Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949592 (https://phabricator.wikimedia.org/T241961) [18:59:08] (03PS1) 10Bartosz Dziewoński: Remove unused RESTBase-related VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949593 (https://phabricator.wikimedia.org/T341618) [19:00:37] (03PS2) 10Bartosz Dziewoński: Remove unusual VisualEditor config for Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949592 (https://phabricator.wikimedia.org/T241961) [19:01:40] (03PS3) 10Bartosz Dziewoński: Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders) [19:01:58] (03PS2) 10Bartosz Dziewoński: Disable upcoming wgMFShowEditNotices in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949545 (https://phabricator.wikimedia.org/T312587) (owner: 10Esanders) [19:02:17] (03PS4) 10Bartosz Dziewoński: Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders) [19:03:10] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10andrea.denisse) 05In progress→03Resolved Hi, I sent a patch for this change that was awaiting review. https://gerrit.wikimedia.org/r/c/operations/puppet/+/940269/ Closing... [19:04:11] (03CR) 10Bartosz Dziewoński: [C: 04-1] "I want to do this next week" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949593 (https://phabricator.wikimedia.org/T341618) (owner: 10Bartosz Dziewoński) [19:07:18] (03CR) 10Bartosz Dziewoński: [C: 04-1] "Will deploy together with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/947015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949588 (https://phabricator.wikimedia.org/T344158) (owner: 10Bartosz Dziewoński) [19:11:11] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10thcipriani) Approved from the `deployment` group. Rationale makes sense. [19:11:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10thcipriani) [19:11:58] (03Abandoned) 10Andrea Denisse: groups: Add taavi to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/940269 (https://phabricator.wikimedia.org/T342307) (owner: 10Andrea Denisse) [19:14:34] (03PS2) 10Gehel: [WIP] Start Balzegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 [19:25:08] (03PS3) 10Gehel: [WIP] Start Balzegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) [19:25:35] (03CR) 10Ahmon Dancy: [WIP] Start Balzegraph from systemd unit, without runBlazegraph.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [19:30:51] Rolling the train to group1 [19:31:03] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949594 (https://phabricator.wikimedia.org/T343724) [19:31:05] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949594 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot) [19:31:36] (03CR) 10Eevans: [C: 03+1] Update kask container image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/913949 (https://phabricator.wikimedia.org/T335691) (owner: 10Ahmon Dancy) [19:31:47] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949594 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot) [19:36:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:40:46] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.22 refs T343724 [19:40:50] T343724: 1.41.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T343724 [19:41:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:48:00] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.22 refs T343724 (duration: 07m 14s) [19:48:04] T343724: 1.41.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T343724 [19:48:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:53:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:58:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T2000) [20:00:05] MatmaRex and aanzx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] i can deploy today [20:00:44] hi [20:00:48] hey! [20:01:34] (03CR) 10Urbanecm: [C: 03+2] Remove unusual VisualEditor config for Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949592 (https://phabricator.wikimedia.org/T241961) (owner: 10Bartosz Dziewoński) [20:01:44] i'm fond of experiments, so...let's see :) [20:01:56] (03CR) 10Urbanecm: [C: 03+2] Disable upcoming wgMFShowEditNotices in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949545 (https://phabricator.wikimedia.org/T312587) (owner: 10Esanders) [20:02:19] (03Merged) 10jenkins-bot: Remove unusual VisualEditor config for Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949592 (https://phabricator.wikimedia.org/T241961) (owner: 10Bartosz Dziewoński) [20:02:35] (03Merged) 10jenkins-bot: Disable upcoming wgMFShowEditNotices in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949545 (https://phabricator.wikimedia.org/T312587) (owner: 10Esanders) [20:02:42] (03CR) 10Urbanecm: [C: 03+2] Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders) [20:02:52] (03PS5) 10Urbanecm: Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders) [20:02:57] (03CR) 10Urbanecm: [C: 03+2] Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders) [20:04:02] (03Merged) 10jenkins-bot: Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders) [20:04:36] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949592|Remove unusual VisualEditor config for Wikitech (T241961)]], [[gerrit:949545|Disable upcoming wgMFShowEditNotices in production (T312587)]], [[gerrit:943558|Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi]] [20:04:44] T312587: Show edit notices within mobile editing interfaces - https://phabricator.wikimedia.org/T312587 [20:04:44] T241961: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 [20:04:46] aanzx: are you around too? [20:06:16] !log urbanecm@deploy1002 esanders and urbanecm and matmarex: Backport for [[gerrit:949592|Remove unusual VisualEditor config for Wikitech (T241961)]], [[gerrit:949545|Disable upcoming wgMFShowEditNotices in production (T312587)]], [[gerrit:943558|Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwde [20:06:16] bug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:06:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:07:05] MatmaRex: all three pulled to mwdebug, but afaics, they're not testable (wikitech's not XWD-enabled and rest are no-ops). is that right? [20:07:29] yeah, i just noticed that the mwdebug stuff doesn't work on wikitech :/ [20:07:34] i guess we're testing this one in production [20:07:38] the rest are indeed no-ops [20:07:43] yeah, i have to sync that out and we'll see [20:07:44] !log urbanecm@deploy1002 esanders and urbanecm and matmarex: Continuing with sync [20:07:46] proceeding [20:11:38] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:14:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:04] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949592|Remove unusual VisualEditor config for Wikitech (T241961)]], [[gerrit:949545|Disable upcoming wgMFShowEditNotices in production (T312587)]], [[gerrit:943558|Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi]] (duration: 09m 27s) [20:14:09] T312587: Show edit notices within mobile editing interfaces - https://phabricator.wikimedia.org/T312587 [20:14:09] T241961: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 [20:14:20] MatmaRex: deployed to prod. can you test the wikitech stuff please? :) [20:15:16] visual editor seems to work: https://wikitech.wikimedia.org/w/index.php?title=Sandbox&diff=prev&oldid=2100385 [20:15:32] thanks for deploying [20:15:56] great! [20:16:10] aanzx: you around? [20:16:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:17:21] urbanecm: yes [20:17:47] ok, let's deploy. [20:18:00] (03PS5) 10Urbanecm: Some initial configurations for suwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [20:18:02] Ok [20:18:38] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:11] (03CR) 10Urbanecm: [C: 03+2] Some initial configurations for suwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [20:19:51] (03Merged) 10jenkins-bot: Some initial configurations for suwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [20:20:30] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949183|Some initial configurations for suwikisource (T344314)]] [20:20:34] T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314 [20:22:46] !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:949183|Some initial configurations for suwikisource (T344314)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:22:55] aanzx: please test [20:23:01] Testing [20:23:04] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:25:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10thcipriani) @Mabualruz I can't remember have you done our https://wikitech.wikimedia.org/wiki/Deployments/Training before? I can't seem to find a task... [20:25:19] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:25:40] urbanecm: tested looks good [20:25:45] thanks, syncing [20:25:46] !log urbanecm@deploy1002 urbanecm and anzx: Continuing with sync [20:32:23] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:32:31] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2002.codfw.wmnet [20:32:32] !log bking@cumin1001 START - Cookbook sre.dns.netbox [20:32:41] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949183|Some initial configurations for suwikisource (T344314)]] (duration: 12m 11s) [20:32:44] T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314 [20:32:55] aanzx: live [20:33:08] urbanecm: ok thanks [20:34:36] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2002.codfw.wmnet - bking@cumin1001" [20:35:23] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2002.codfw.wmnet - bking@cumin1001" [20:35:23] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:35:23] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2002.codfw.wmnet on all recursors [20:35:26] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk2002.codfw.wmnet on all recursors [20:35:52] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2002.codfw.wmnet - bking@cumin1001" [20:36:36] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2002.codfw.wmnet - bking@cumin1001" [20:37:23] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:37:34] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2002.codfw.wmnet with OS bookworm [20:51:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:55:02] (ConfdResourceFailed) firing: (96) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:56:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230816T2100) [21:40:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:30] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:33] !log T343124 [WDQS] Pooled `wdqs1012` and `wdqs1013` (passing checks after reimage/data transfer) [21:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:37] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [21:52:27] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk2002.codfw.wmnet with OS bookworm [21:52:27] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk2002.codfw.wmnet [22:01:06] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:44] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:08] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:07:07] (ProbeDown) firing: (13) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:13:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:13:53] hm, that's just esams but I thought it was silenced already [22:14:32] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:14:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:15:09] (03PS4) 10Bking: [WIP] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [22:16:39] (JobUnavailable) firing: (2) Reduced availability for job trafficserver-text in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:16:48] oh I see, it looks like s.ukhe's silence covered module=http_ncredir-https_ip[46] but only family=ip4, adding ip6 now [22:18:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:18:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:19:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:20:14] done, and matched both values for address too [22:27:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:42] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:00] (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:11:08] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:36] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:32] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state