[00:03:25] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:42:05] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:13] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:51:49] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:20:43] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:31:09] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:32:11] <icinga-wm>	 PROBLEM - snapshot of s7 in codfw on backupmon1001 is CRITICAL: snapshot for s7 at codfw (db2098) taken more than 3 days ago: Most recent backup 2022-09-17 01:18:56 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:32:19] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:36:45] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48683 bytes in 4.537 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:37:47] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T0200)
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:07:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.2 [core] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833136 (https://phabricator.wikimedia.org/T314191)
[02:07:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.2 [core] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833136 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot)
[02:08:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:08:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:08:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:11:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:16] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.2 [core] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833136 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot)
[02:27:23] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:34:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:34:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:34:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:35:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:38:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:42:35] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[02:43:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:47:53] <icinga-wm>	 PROBLEM - Check systemd state on dbprov1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T0300)
[03:01:11] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833137 (https://phabricator.wikimedia.org/T314191)
[03:01:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833137 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot)
[03:02:00] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833137 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot)
[03:02:28] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.2  refs T314191
[03:02:32] <stashbot>	 T314191: 1.40.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T314191
[03:04:25] <icinga-wm>	 PROBLEM - dump of matomo in eqiad on backupmon1001 is CRITICAL: Last dump for matomo at eqiad (db1108) taken on 2022-09-20 02:57:27 is 551 MiB, but the previous one was 231 MiB, a change of +138.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:05:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[03:06:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[03:06:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[03:07:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[03:23:11] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:28:33] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:31:51] <icinga-wm>	 PROBLEM - Check systemd state on dbprov2003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:38:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[03:38:36] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.2  refs T314191 (duration: 36m 08s)
[03:38:39] <stashbot>	 T314191: 1.40.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T314191
[03:40:40] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.39.0-wmf.28 (duration: 02m 02s)
[03:43:29] <icinga-wm>	 RECOVERY - Check systemd state on dbprov1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:45:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[03:45:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[03:51:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[03:56:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[04:03:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[04:03:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[04:09:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[04:25:11] <icinga-wm>	 RECOVERY - Check systemd state on dbprov2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:07:51] <wikibugs>	 (03PS4) 10Abijeet Patro: Add editcontentmodel right for metawiki translation administrators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587)
[05:08:00] <wikibugs>	 (03CR) 10Abijeet Patro: Add editcontentmodel right for metawiki translation administrators (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro)
[05:29:39] <icinga-wm>	 PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:47:23] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 251, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:49:41] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 252, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:50:51] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:59:14] <wikibugs>	 (03PS3) 10KartikMistry: testwiki: Enable Section Translation on haw, la, ps and, xh Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832993 (https://phabricator.wikimedia.org/T317289)
[06:00:04] <jouncebot>	 kormat, marostegui, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T0600).
[06:00:09] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:10:33] <wikibugs>	 10ops-codfw, 10DBA, 10Data-Persistence, 10Data-Persistence-Backup, and 2 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10Marostegui)
[06:20:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] P:gitlab::runner: $nameservers parameter type should match aliased [puppet] - 10https://gerrit.wikimedia.org/r/833046 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall)
[06:23:12] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] buildkitd: Bump version to 0.10.4 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/830909 (owner: 10Dduvall)
[06:30:57] <icinga-wm>	 RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:44:39] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 3 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10wiki_willy) Yeah, it looks like we're just past the warranty period.  @Papaul - do you want to try and see if we're still able to to submit a RMA?  And if it won't let you, let me know...
[06:57:53] <kart_>	 Updating cxserver before backport deployment window.. I'm only one in the window with minor patch so far..
[06:58:01] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-09-15-113346-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/832989 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry)
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T0700).
[07:00:04] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:44] * kart_ is here..
[07:01:38] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2022-09-15-113346-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/832989 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry)
[07:02:14] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] testwiki: Enable Section Translation on haw, la, ps and, xh Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832993 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry)
[07:02:38] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[07:02:59] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Enable Section Translation on haw, la, ps and, xh Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832993 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry)
[07:03:10] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[07:05:30] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[07:06:24] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[07:06:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:07:22] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[07:07:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:07:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:08:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:08:19] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[07:10:50] <kart_>	 !log Updated cxserver to 2022-09-15-113346-production (T317289, T315209)
[07:10:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:55] <stashbot>	 T315209: Parsoid clients should handle multi-valued rel attributes - https://phabricator.wikimedia.org/T315209
[07:10:55] <stashbot>	 T317289: Enable Content and Section translation on 6 Wikipedias - https://phabricator.wikimedia.org/T317289
[07:13:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:14:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:14:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:15:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:18:18] <logmsgbot>	 !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:832993|testwiki: Enable Section Translation on haw, la, ps and, xh Wikipedias (T317289)]] (duration: 03m 46s)
[07:18:21] <stashbot>	 T317289: Enable Content and Section translation on 6 Wikipedias - https://phabricator.wikimedia.org/T317289
[07:28:33] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:51:05] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:55:29] <icinga-wm>	 PROBLEM - SSH on mw1311.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:00:05] <jouncebot>	 jnuche and dancy: (Dis)respected human, time to deploy MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T0800). Please do the needful.
[08:07:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:12:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:22:49] <wikibugs>	 (03PS1) 10Jbond: admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315
[08:24:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315 (owner: 10Jbond)
[08:25:59] <wikibugs>	 (03PS2) 10Jbond: admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315
[08:28:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:33:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET namespaces) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:33:20] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[08:33:48] <logmsgbot>	 !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet
[08:35:20] <hashar>	 !log Restarted CI Jenkins for plugin update
[08:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:39] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: cast firmware_store_dir to Path [cookbooks] - 10https://gerrit.wikimedia.org/r/833318
[08:41:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: cast firmware_store_dir to Path [cookbooks] - 10https://gerrit.wikimedia.org/r/833318 (owner: 10Jbond)
[08:43:46] <logmsgbot>	 !log awight@deploy1002 Started deploy [kartotherian/deploy@4759a78]: Merge "Update kartotherian to e3f3854"
[08:45:31] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: cast firmware_store_dir to Path [cookbooks] - 10https://gerrit.wikimedia.org/r/833318 (owner: 10Jbond)
[08:46:13] <logmsgbot>	 !log awight@deploy1002 Finished deploy [kartotherian/deploy@4759a78]: Merge "Update kartotherian to e3f3854" (duration: 02m 27s)
[08:46:35] <wikibugs>	 (03CR) 10Phuedx: [C: 03+1] Drop deprecated survey prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832464 (https://phabricator.wikimedia.org/T317862) (owner: 10Awight)
[08:51:07] <wikibugs>	 (03Abandoned) 10Milimetric: aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/829233 (owner: 10Milimetric)
[08:56:45] <icinga-wm>	 RECOVERY - SSH on mw1311.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:59:37] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:04:09] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48682 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:27:13] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[09:31:05] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:51:53] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:58:07] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts sretest1002.eqiad.wmnet
[10:01:13] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:06:53] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "should be good to deploy during the backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832999 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight)
[10:28:50] <wikibugs>	 (03PS4) 10Samtar: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695)
[10:30:39] <wikibugs>	 (03CR) 10Samtar: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar)
[11:01:03] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 3 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10jcrespo) @Ladsgroup @Marostegui Thanks for the help debugging. Until issues for db2098 are solved, I have setup db2100 (s8, and newly created s7 instance) to replace db2098 backups fun...
[11:58:31] <icinga-wm>	 RECOVERY - snapshot of s7 in codfw on backupmon1001 is OK: Last snapshot for s7 at codfw (db2100) taken on 2022-09-20 09:48:03 (966 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[12:09:05] <wikibugs>	 (03PS1) 10Gergő Tisza: Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833037 (https://phabricator.wikimedia.org/T310320)
[12:09:51] <wikibugs>	 (03PS1) 10Gergő Tisza: Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833038 (https://phabricator.wikimedia.org/T310320)
[12:14:35] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833037 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza)
[12:15:38] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833038 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza)
[12:35:09] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:43:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:48:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:55:59] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Looks like that is working ;)" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/832518 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar)
[12:56:28] <wikibugs>	 (03Merged) 10jenkins-bot: Use gerrit-deploy for deployment on devtools [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/832518 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar)
[12:56:47] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10CMyrick-WMF) The username I received via email is 39451. My shell username is cmyrick.  When I am trying to execute kinit, I am having an error tha...
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1300).
[13:00:05] <jouncebot>	 Thiemo_WMDE and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1300)
[13:00:21] <urbanecm>	 I can deploy today
[13:00:36] <urbanecm>	 hi Thiemo_WMDE, tgr_: around?
[13:00:48] <Thiemo_WMDE>	 I'm here.
[13:01:04] <urbanecm>	 ok, let's start with you then :)
[13:01:08] <wikibugs>	 (03PS2) 10Urbanecm: Enable Tech Wishes survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832999 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight)
[13:01:13] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable Tech Wishes survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832999 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight)
[13:01:15] <Lucas_WMDE>	 \o/
[13:01:19] <wikibugs>	 (03PS3) 10Hashar: Gerrit v3.5.2 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824200 (https://phabricator.wikimedia.org/T307334)
[13:01:31] <Thiemo_WMDE>	 Only configuration this time ;)
[13:01:40] <urbanecm>	 :)
[13:01:59] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Tech Wishes survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832999 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight)
[13:02:30] <urbanecm>	 Thiemo_WMDE: pulled to mwdebug1001, can you check?
[13:03:22] <Thiemo_WMDE>	 1001. One second.
[13:03:26] <urbanecm>	 sure
[13:03:51] <Thiemo_WMDE>	 Yes, works!
[13:03:56] <urbanecm>	 great, syncing
[13:05:13] <tgr_>	 urbanecm: I'll be late, can deploy myself.
[13:05:30] <urbanecm>	 tgr_: ack, I'll ping you once done.
[13:05:42] <tgr_>	 Or you can just merge them if you are okay with that, they don't need testing.
[13:06:14] <urbanecm>	 ah, yeah, it's just a schema version change
[13:06:15] <urbanecm>	 i'll sync that
[13:06:22] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833037 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza)
[13:06:28] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833038 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza)
[13:06:58] <wikibugs>	 (03PS3) 10Hashar: gerrit: remove unused mysql-connector-java lib [puppet] - 10https://gerrit.wikimedia.org/r/832344
[13:07:42] <Thiemo_WMDE>	 Can confirm the change is live now, btw. Thanks!
[13:07:50] <urbanecm>	 Thiemo_WMDE: it's not yet, last few seconds
[13:08:25] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0b55db6f80df5f4c89f969332a6b31077a7172c4: Enable Tech Wishes survey on dewiki (T316676) (duration: 04m 12s)
[13:08:29] <stashbot>	 T316676: Set up QuickSurvey for TechnicalWishes participatory development evaluation - https://phabricator.wikimedia.org/T316676
[13:08:30] <urbanecm>	 Thiemo_WMDE: it's live now
[13:08:37] <Thiemo_WMDE>	 Yea, works.
[13:08:58] <urbanecm>	 great!
[13:09:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:09:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:12:45] <wikibugs>	 (03PS8) 10Hashar: gerrit: decouple scap and daemon users [puppet] - 10https://gerrit.wikimedia.org/r/832345
[13:13:29] <wikibugs>	 (03PS4) 10Hashar: gerrit: change deployment user on devtools [puppet] - 10https://gerrit.wikimedia.org/r/832507
[13:14:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:16:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:16:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:16:35] <icinga-wm>	 PROBLEM - SSH on analytics1077.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:17:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:19:55] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:24:05] <wikibugs>	 10SRE, 10Phabricator: https://phab.wmflabs.org/ Is down - https://phabricator.wikimedia.org/T318158 (10Devnull)
[13:25:36] <logmsgbot>	 !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@0e9fb6b]: (no justification provided)
[13:25:47] <logmsgbot>	 !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@0e9fb6b]: (no justification provided) (duration: 00m 11s)
[13:29:17] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:43] <wikibugs>	 (03Merged) 10jenkins-bot: Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833037 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza)
[13:32:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833038 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza)
[13:33:58] <urbanecm>	 :/
[13:34:24] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833038 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza)
[13:36:09] <urbanecm>	 tgr_: syncing yours
[13:37:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:38:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:38:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:39:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:39:57] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.40.0-wmf.1/extensions/GrowthExperiments/extension.json: 1a27e05a7ca53a063d5f9e284d6a09546ac8691c: Update HomepageModule schema version (T310320) (duration: 03m 52s)
[13:40:00] <stashbot>	 T310320: Account creation + Growth tools: improve UX for newcomers who create an account while mid-edit - https://phabricator.wikimedia.org/T310320
[13:43:37] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.40.0-wmf.2/extensions/GrowthExperiments/extension.json: 1ac09d4709c645558f644a885fadc49c05cc04b9: Update HomepageModule schema version (T310320) (duration: 03m 39s)
[13:43:46] <urbanecm>	 tgr_: should be live
[13:44:22] <tgr_>	 thanks urbanecm !
[13:44:25] <urbanecm>	 np
[13:44:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:45:02] <tgr_>	 the schema validation errors haven't stopped but these are frontend events so I imagine it will take a couple minutes.
[13:45:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:45:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:46:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:48:01] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:50:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1189', diff saved to https://phabricator.wikimedia.org/P34884 and previous config saved to /var/cache/conftool/dbconfig/20220920-135006-ladsgroup.json
[13:50:21] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:52:22] <wikibugs>	 (03PS1) 10Hashar: gerrit: make homedir variable [puppet] - 10https://gerrit.wikimedia.org/r/833379
[13:57:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) 05Resolved→03Open @wiki_willy @Jclark-ctr  looks like we are having memory issues again and the host crashed. Could it be the mainboard? ` [Mon Sep 19 13:20:29 2022] EDAC MC1: 1 UE memory read error...
[14:00:29] <logmsgbot>	 !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@1a7c3b9]: (no justification provided)
[14:00:44] <logmsgbot>	 !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@1a7c3b9]: (no justification provided) (duration: 00m 15s)
[14:06:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) And the hardware logs: ` ------------------------------------------------------------------------------- Record:      80 Date/Time:   09/19/2022 13:20:27 Source:      system Severity:    Critical Descri...
[14:08:39] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/833379 (owner: 10Hashar)
[14:10:16] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[14:15:48] <wikibugs>	 (03PS1) 10Marostegui: Revert "Revert "db1189: Disable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/833039
[14:16:29] <logmsgbot>	 !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet
[14:17:00] <wikibugs>	 (03PS2) 10Marostegui: Revert "Revert "db1189: Disable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/833039
[14:17:51] <icinga-wm>	 RECOVERY - SSH on analytics1077.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:18:23] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "Revert "db1189: Disable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/833039 (owner: 10Marostegui)
[14:29:55] <wikibugs>	 (03PS1) 10Hashar: gerrit: make daemon_user variable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/833385
[14:30:09] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:52] <wikibugs>	 (03PS1) 10Ssingh: certspotter: remove deprecated Google CT logs [puppet] - 10https://gerrit.wikimedia.org/r/833387
[14:32:14] <sukhe>	 ^ the certspotter failures on the alerting host should be resolved by this
[14:33:09] <wikibugs>	 (03PS2) 10Hashar: gerrit: make homedir variable [puppet] - 10https://gerrit.wikimedia.org/r/833379
[14:33:48] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar)
[14:34:42] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37299/console" [puppet] - 10https://gerrit.wikimedia.org/r/833387 (owner: 10Ssingh)
[14:37:43] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:40:27] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: remove deprecated Google CT logs [puppet] - 10https://gerrit.wikimedia.org/r/833387 (owner: 10Ssingh)
[14:51:17] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:52:17] <sukhe>	 ^ should be resolved by 04ee08339
[14:53:37] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:00] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Eventlogs, Stats for Simulo-wikitech - https://phabricator.wikimedia.org/T318058 (10Ottomata) @awight I think you will need a WMF manager to approve/sponser this, as well as an expiry date MOA for this account.    Also, please review https://wikitech.wikimedia...
[15:03:08] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering-Operations: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Ottomata)
[15:04:46] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering-Operations: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Ottomata) The rest of the access for this account was given in {T314117}.  This request is adding Kerberos only.  I approve.    Doing now...
[15:06:49] <wikibugs>	 (03PS1) 10Ottomata: Set krb: present for alinebruenger [puppet] - 10https://gerrit.wikimedia.org/r/833399 (https://phabricator.wikimedia.org/T314117)
[15:07:16] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering-Operations: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Ottomata) @Aline_Bruenger_WMDE you should have an email with instructions to login and set your kerberos password.   Please confirm when you've done so. :)
[15:10:42] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10Ottomata) Interesting!  @BCornwall may have used your uid rather than your username when creating the kerberos principal.  @CMyrick-WMF I have just...
[15:11:27] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Set krb: present for alinebruenger [puppet] - 10https://gerrit.wikimedia.org/r/833399 (https://phabricator.wikimedia.org/T314117) (owner: 10Ottomata)
[15:17:22] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:17:43] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering-Operations: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Siko_WMDE) Hi @Ottomata ,  I also would need Kerberos (siko, username: Siko WMDE), the rest of the access for my account was also already given (see https:/...
[15:24:09] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315 (owner: 10Jbond)
[15:27:13] <wikibugs>	 10SRE, 10serviceops, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10LSobanski) 05Open→03Stalled
[15:27:16] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10LSobanski)
[15:33:24] <wikibugs>	 (03PS3) 10Jbond: admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315
[15:35:07] <wikibugs>	 (03PS1) 10Ottomata: Set krb: present for siko [puppet] - 10https://gerrit.wikimedia.org/r/833404 (https://phabricator.wikimedia.org/T316766)
[15:35:33] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Operations, 10Patch-For-Review: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Ottomata) Ah, sorry missed that.  I can do it here.  Link to {T315878}.  You should have an email now.
[15:38:06] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Set krb: present for siko [puppet] - 10https://gerrit.wikimedia.org/r/833404 (https://phabricator.wikimedia.org/T316766) (owner: 10Ottomata)
[15:39:02] <dancy>	 jouncebot nowandnext
[15:39:02] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 20 minute(s)
[15:39:02] <jouncebot>	 In 0 hour(s) and 20 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1600)
[15:40:58] <wikibugs>	 (03PS1) 10Aqu: Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882)
[15:45:17] <awight>	 brett: Hi, I'm planning to do a quick, "urgent" mw-config patch to disable a QuickSurvey.
[15:45:59] <brett>	 awight: I think you may have tagged the wrong person ^^;
[15:46:41] <awight>	 oops, sorry!  thcipriani ^ deployment request
[15:47:12] <thcipriani>	 jouncebot: now
[15:47:12] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 12 minute(s)
[15:47:23] <thcipriani>	 jouncebot: next
[15:47:23] <jouncebot>	 In 0 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1600)
[15:47:29] <thcipriani>	 awight: go for it
[15:48:13] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833411 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight)
[15:48:27] <awight>	 ty!
[15:48:50] <wikibugs>	 10SRE, 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10LSobanski) Needs consulting with Alex / Giuseppe before proceeding.
[15:49:14] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833411 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight)
[15:50:24] <wikibugs>	 (03Merged) 10jenkins-bot: Disable Tech Wishes survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833411 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight)
[15:51:47] <wikibugs>	 10SRE, 10SRE-OnFire, 10Patch-For-Review: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355 (10LSobanski)
[15:52:22] <wikibugs>	 (03PS1) 10Aqu: Deploy Spark 3 to production [puppet] - 10https://gerrit.wikimedia.org/r/833412 (https://phabricator.wikimedia.org/T312882)
[15:55:10] <logmsgbot>	 !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:833411|Disable Tech Wishes survey on dewiki (T316676)]] (duration: 03m 53s)
[15:55:14] <stashbot>	 T316676: Set up QuickSurvey for TechnicalWishes participatory development evaluation - https://phabricator.wikimedia.org/T316676
[15:55:35] <awight>	 thcipriani: Done. Thanks again! o/
[15:55:48] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Disable Tech Wishes survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833411 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight)
[15:57:53] <wikibugs>	 (03CR) 10Ahmon Dancy: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar)
[16:00:04] <jouncebot>	 jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1600)
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:14] <awight>	 thcipriani: I forgot how to deploy.  One more try.
[16:00:38] <dancy>	 awight: `scap deploy 833411`
[16:00:45] <dancy>	 or, not deploy.. backport
[16:00:47] <dancy>	 `scap backport 833411`
[16:00:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10BCornwall) Indeed, I did use the uid. Thanks for handling that, @Ottomata.
[16:01:31] <awight>	 dancy: Amazing, I didn't know about this yet!  I was using this tool: https://deploy-commands.toolforge.org/bacc/833411
[16:01:52] <awight>	 But I failed to run "git rebase" so just spent 10 minutes deploying the old version of a file /o\
[16:01:57] <dancy>	 I'm hoping deploy-commands.toolforge will be updated to recommend scap backport soon.
[16:02:13] <dancy>	 doh!
[16:04:39] <awight>	 Reading through the source for `scap backport` now, very nice work!
[16:04:41] <logmsgbot>	 !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:833411|Disable Tech Wishes survey on dewiki (T316676)]] (take 2) (duration: 03m 42s)
[16:04:45] <stashbot>	 T316676: Set up QuickSurvey for TechnicalWishes participatory development evaluation - https://phabricator.wikimedia.org/T316676
[16:04:49] <awight>	 okay, actually done.
[16:06:10] * Lucas_WMDE will have to look at scap backport
[16:06:34] <dancy>	 We intend to do a writeup and advertise it more soon.  
[16:06:56] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap.cfg.erb: Remove unused release_repo_update_mediawiki_releases_values_cmd [puppet] - 10https://gerrit.wikimedia.org/r/833414
[16:07:03] <Lucas_WMDE>	 nice
[16:09:00] <logmsgbot>	 !log awight@deploy1002 backport aborted:  (duration: 00m 33s)
[16:09:15] <dancy>	 jouncebot now
[16:09:15] <jouncebot>	 For the next 0 hour(s) and 50 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1600)
[16:09:15] <awight>	 haha it's a bit spammy
[16:09:22] <dancy>	 nod.
[16:09:37] <dancy>	 rzl/jbond: Can you deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/833414 please?
[16:10:15] <awight>	 Looks like I need to install some gerrit credentials to use this.
[16:10:41] <jbond>	 dancy: we are all at an SRe summit currently, i should be free in 20-40 mins can it wait?
[16:10:56] <dancy>	 hmmm.. can you show me a transcript of what you're dealing with? It should all be automatic if you're in the `deployment` group.
[16:11:06] <dancy>	 jbond:  Can do! thanks!
[16:11:15] <jbond>	 dancy: cool ill ping when free
[16:12:53] <wikibugs>	 (03PS1) 10Dduvall: gitlab_runner: Bump buildkitd version to 0.10.4-2 [puppet] - 10https://gerrit.wikimedia.org/r/833416 (https://phabricator.wikimedia.org/T318019)
[16:15:54] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:30] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:23:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] scap.cfg.erb: Remove unused release_repo_update_mediawiki_releases_values_cmd [puppet] - 10https://gerrit.wikimedia.org/r/833414 (owner: 10Ahmon Dancy)
[16:24:05] <dancy>	 jbond: thanks!
[16:28:03] <jbond>	 dancy: ^^ merged, let me know if you want me to deploy it somewhere so you can test?
[16:28:27] <dancy>	 Sure, deploy1002
[16:31:05] <jbond>	 dancy: done https://phabricator.wikimedia.org/P34886
[16:31:41] <wikibugs>	 (03PS4) 10Jbond: admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315
[16:33:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315 (owner: 10Jbond)
[16:34:50] <wikibugs>	 (03PS1) 10Ebernhardson: sre.wdqs.data-reload: puppet should stay disabled during data reload [cookbooks] - 10https://gerrit.wikimedia.org/r/833422
[16:34:52] <wikibugs>	 (03PS1) 10Ebernhardson: sre.wdqs.data-reload: Add option to reuse munge [cookbooks] - 10https://gerrit.wikimedia.org/r/833423
[16:41:47] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing, disregard
[16:42:19] <logmsgbot>	 !log dancy@deploy1002 dancy: testing, disregard synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[16:42:25] <logmsgbot>	 !log dancy@deploy1002 Sync cancelled.
[16:42:27] <dancy>	 jbond: All is well. Thanks again.
[16:44:27] <jbond>	 dancy: np
[16:50:21] <wikibugs>	 (03Abandoned) 10Ebernhardson: sre.wdqs.data-reload: puppet should stay disabled during data reload [cookbooks] - 10https://gerrit.wikimedia.org/r/833422 (owner: 10Ebernhardson)
[16:50:28] <wikibugs>	 (03PS2) 10Ebernhardson: sre.wdqs.data-reload: Add option to reuse munge [cookbooks] - 10https://gerrit.wikimedia.org/r/833423
[17:06:46] <icinga-wm>	 PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:09:08] <icinga-wm>	 RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:09:30] <wikibugs>	 (03CR) 10Phuedx: [C: 03+1] [WIP] Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832393 (https://phabricator.wikimedia.org/T317841) (owner: 10Awight)
[17:10:30] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:14:11] <wikibugs>	 (03PS1) 10Vlad.shapik: Add configurations to get an HTML test coverage report [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016)
[17:15:16] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:38:49] <wikibugs>	 (03PS1) 10TChin: Bump eventgate-* image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/833428 (https://phabricator.wikimedia.org/T313202)
[17:41:45] <wikibugs>	 10SRE, 10Analytics-Radar, 10Domains, 10Traffic-Icebox, 10WMF-General-or-Unknown: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10BCornwall) I think this ticket could do with some more clarification: Whi...
[17:45:20] <icinga-wm>	 PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:46:46] <wikibugs>	 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Gehel)
[17:55:47] <dancy>	 jouncebot now
[17:55:47] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 4 minute(s)
[17:55:52] <dancy>	 jouncebot nowandnext
[17:55:52] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 4 minute(s)
[17:55:52] <jouncebot>	 In 0 hour(s) and 4 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1800)
[17:56:30] <dancy>	 I'll roll the train forward about 20-30 minutes into the upcoming train window.  Need to go afk for a bit.
[17:57:32] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid
[17:59:54] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[18:10:55] <wikibugs>	 (03PS1) 10Jdlrobson: Enable Nearby everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833435 (https://phabricator.wikimedia.org/T246493)
[18:11:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable Nearby everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833435 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson)
[18:16:46] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Bump eventgate-* image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/833428 (https://phabricator.wikimedia.org/T313202) (owner: 10TChin)
[18:18:19] <wikibugs>	 (03PS2) 10Jdlrobson: Enable Nearby everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833435 (https://phabricator.wikimedia.org/T246493)
[18:19:36] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[18:20:26] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[18:21:32] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[18:22:36] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[18:22:49] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply
[18:23:38] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply
[18:26:25] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[18:27:03] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[18:27:15] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply
[18:28:17] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply
[18:28:28] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[18:29:21] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[18:30:47] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[18:31:12] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[18:31:20] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:31:33] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833438 (https://phabricator.wikimedia.org/T314191)
[18:31:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833438 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot)
[18:31:59] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[18:32:17] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833438 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot)
[18:32:55] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[18:33:02] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[18:33:47] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[18:36:48] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.2  refs T314191
[18:36:52] <stashbot>	 T314191: 1.40.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T314191
[18:37:45] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Reenable notifications on db2100 after adding s7 [puppet] - 10https://gerrit.wikimedia.org/r/833124 (https://phabricator.wikimedia.org/T318062)
[18:37:48] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Reduce memory consumption of db2100 (s7) [puppet] - 10https://gerrit.wikimedia.org/r/833439
[18:39:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:40:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:40:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:40:37] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Reduce memory consumption of db2100 (s7) [puppet] - 10https://gerrit.wikimedia.org/r/833439
[18:42:12] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reduce memory consumption of db2100 (s7) [puppet] - 10https://gerrit.wikimedia.org/r/833439 (owner: 10Jcrespo)
[18:42:40] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] sre.wdqs.data-reload: Add option to reuse munge [cookbooks] - 10https://gerrit.wikimedia.org/r/833423 (owner: 10Ebernhardson)
[18:44:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:45:43] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[18:46:07] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[18:46:26] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Fix memory consumption of db2100 (s7) - followup [puppet] - 10https://gerrit.wikimedia.org/r/833440
[18:46:36] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply
[18:46:44] <icinga-wm>	 RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:47:17] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply
[18:47:26] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[18:47:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[18:47:34] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:48:06] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Fix memory consumption of db2100 (s7) - followup [puppet] - 10https://gerrit.wikimedia.org/r/833440 (owner: 10Jcrespo)
[18:48:25] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[18:50:39] <jynus>	 !log restart db2100:s7 to apply new config
[18:50:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:15] <wikibugs>	 (03PS1) 10Bking: wdqs data-reload: fix arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/833441
[18:53:20] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Reenable notifications on db2100 after adding s7 [puppet] - 10https://gerrit.wikimedia.org/r/833124 (https://phabricator.wikimedia.org/T318062)
[18:53:52] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wdqs data-reload: fix arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/833441 (owner: 10Bking)
[18:54:38] <wikibugs>	 (03PS1) 10DLynch: Register the editattempt_block schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833442 (https://phabricator.wikimedia.org/T310390)
[18:55:04] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications on db2100 after adding s7 [puppet] - 10https://gerrit.wikimedia.org/r/833124 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo)
[18:58:19] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs data-reload: fix arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/833441 (owner: 10Bking)
[19:05:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[19:07:27] <wikibugs>	 (03PS5) 10Samtar: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695)
[19:07:35] <wikibugs>	 (03PS6) 10Samtar: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695)
[19:11:49] <wikibugs>	 (03CR) 10Samtar: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar)
[19:27:05] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] charts:eventstreams bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/831957 (https://phabricator.wikimedia.org/T292390) (owner: 10Gmodena)
[19:31:07] <wikibugs>	 (03Merged) 10jenkins-bot: charts:eventstreams bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/831957 (https://phabricator.wikimedia.org/T292390) (owner: 10Gmodena)
[19:53:00] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "Looks reasonable to me.  Analogous to https://gerrit.wikimedia.org/r/c/operations/puppet/+/810146" [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar)
[19:53:41] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1003/37300/" [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar)
[19:54:37] <wikibugs>	 (03PS1) 10Gmodena: Bump eventstreams chart version. [deployment-charts] - 10https://gerrit.wikimedia.org/r/833447 (https://phabricator.wikimedia.org/T292390)
[19:55:11] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Bump eventstreams chart version. [deployment-charts] - 10https://gerrit.wikimedia.org/r/833447 (https://phabricator.wikimedia.org/T292390) (owner: 10Gmodena)
[19:56:30] <wikibugs>	 (03PS3) 10Samtar: Enable Nearby everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833435 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson)
[19:56:40] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@62d8262]: Regular analytics weekly train [analytics/refinery@62d8262]
[19:58:59] <wikibugs>	 (03Merged) 10jenkins-bot: Bump eventstreams chart version. [deployment-charts] - 10https://gerrit.wikimedia.org/r/833447 (https://phabricator.wikimedia.org/T292390) (owner: 10Gmodena)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T2000). Please do the needful.
[20:00:04] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:20] <TheresNoTime>	 evening
[20:00:57] * TheresNoTime can deploy!
[20:02:00] <cjming>	 hi TheresNoTime -- I can deploy Jon's patch -- and since it's the only one in the queue, it will be quick
[20:02:07] <icinga-wm>	 RECOVERY - Check systemd state on cloudbackup2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:02:13] <TheresNoTime>	 cjming: go for it :D
[20:02:30] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[20:02:33] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[20:02:34] <cjming>	 will do - thanks for being willing!
[20:03:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833435 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson)
[20:04:10] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Nearby everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833435 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson)
[20:04:38] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:833435|Enable Nearby everywhere (T246493)]]
[20:04:40] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@62d8262]: Regular analytics weekly train [analytics/refinery@62d8262] (duration: 08m 00s)
[20:04:42] <stashbot>	 T246493: [EPIC] Deploy NearbyPages everywhere - https://phabricator.wikimedia.org/T246493
[20:05:03] <logmsgbot>	 !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:833435|Enable Nearby everywhere (T246493)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[20:05:23] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@62d8262] (thin): Regular analytics weekly train THIN [analytics/refinery@62d8262]
[20:05:30] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@62d8262] (thin): Regular analytics weekly train THIN [analytics/refinery@62d8262] (duration: 00m 07s)
[20:09:09] <icinga-wm>	 PROBLEM - Check systemd state on cloudbackup2002 is CRITICAL: CRITICAL - degraded: The following units failed: block_sync-misc-project.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:10:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:11:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:11:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:13:41] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:833435|Enable Nearby everywhere (T246493)]] (duration: 09m 02s)
[20:13:44] <stashbot>	 T246493: [EPIC] Deploy NearbyPages everywhere - https://phabricator.wikimedia.org/T246493
[20:15:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:15:39] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[20:19:21] <cjming>	 shutting it down early today
[20:19:24] <cjming>	 !log end of UTC late backport window
[20:19:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:30] <wikibugs>	 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) @akosiaris, good news, Gabriele is working on this!!!  @Jelto @JMeybohm, it seems the upgr...
[20:27:08] <wikibugs>	 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) BTW, since we merged the helm chart changes, eventstreams is currently undeployable.  We r...
[20:30:51] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] gitlab: reduce backup_keep_time to 1d [puppet] - 10https://gerrit.wikimedia.org/r/829747 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[20:59:23] <wikibugs>	 (03Abandoned) 10Brennen Bearnes: GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes)
[21:00:22] <wikibugs>	 (03Abandoned) 10Brennen Bearnes: scap: separate new rev perms from old rev perm cleanup [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/825911 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes)
[21:03:08] <wikibugs>	 (03PS1) 10Sbailey: Enable Linter write of namespace tag and template fields on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177)
[21:05:36] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap.cfg.erb: Set initial value of beta_only_config_files [puppet] - 10https://gerrit.wikimedia.org/r/833455 (https://phabricator.wikimedia.org/T317242)
[21:06:47] <wikibugs>	 (03PS2) 10Ahmon Dancy: scap.cfg.erb: Set initial value of beta_only_config_files [puppet] - 10https://gerrit.wikimedia.org/r/833455 (https://phabricator.wikimedia.org/T317242)
[21:15:49] <wikibugs>	 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10Platform Team Workboards (Clinic Duty Team): Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Milimetric) This is a nasty bug if Andrew happens to not be around, I just wanna ++ the tech debt val...
[21:28:55] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10Technical-Debt: Fix rule violation in the lvs balancer role - https://phabricator.wikimedia.org/T264132 (10BCornwall)
[21:29:02] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10Technical-Debt: Fix rule violation in the lvs balancer role - https://phabricator.wikimedia.org/T264132 (10BCornwall) a:03BCornwall
[21:38:13] <icinga-wm>	 PROBLEM - Disk space on dumpsdata1001 is CRITICAL: DISK CRITICAL - free space: /data 855221 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops
[22:04:38] <wikibugs>	 (03PS1) 10Zabe: Replace deployment-db08 with deployment-db09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833461 (https://phabricator.wikimedia.org/T318126)
[22:28:07] <icinga-wm>	 PROBLEM - SSH on analytics1077.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:30:15] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711)
[22:30:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson)
[22:51:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Reopened ticket With dell