[00:03:25] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:42:05] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:13] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:49] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:20:43] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:31:09] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:32:11] PROBLEM - snapshot of s7 in codfw on backupmon1001 is CRITICAL: snapshot for s7 at codfw (db2098) taken more than 3 days ago: Most recent backup 2022-09-17 01:18:56 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:32:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:36:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48683 bytes in 4.537 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:37:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T0200) [02:06:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:07:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.2 [core] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833136 (https://phabricator.wikimedia.org/T314191) [02:07:33] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.2 [core] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833136 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [02:08:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:08:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:08:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:16] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.2 [core] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833136 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [02:27:23] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:34:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:34:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:34:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:35:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:38:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:42:35] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [02:43:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:47:53] PROBLEM - Check systemd state on dbprov1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T0300) [03:01:11] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833137 (https://phabricator.wikimedia.org/T314191) [03:01:13] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833137 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [03:02:00] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833137 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [03:02:28] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.2 refs T314191 [03:02:32] T314191: 1.40.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T314191 [03:04:25] PROBLEM - dump of matomo in eqiad on backupmon1001 is CRITICAL: Last dump for matomo at eqiad (db1108) taken on 2022-09-20 02:57:27 is 551 MiB, but the previous one was 231 MiB, a change of +138.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:05:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:06:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:06:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:07:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:23:11] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:28:33] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:31:51] PROBLEM - Check systemd state on dbprov2003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:38:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:38:36] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.2 refs T314191 (duration: 36m 08s) [03:38:39] T314191: 1.40.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T314191 [03:40:40] !log mwpresync@deploy1002 Pruned MediaWiki: 1.39.0-wmf.28 (duration: 02m 02s) [03:43:29] RECOVERY - Check systemd state on dbprov1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:45:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:51:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:56:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:03:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:03:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:09:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:25:11] RECOVERY - Check systemd state on dbprov2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:07:51] (03PS4) 10Abijeet Patro: Add editcontentmodel right for metawiki translation administrators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) [05:08:00] (03CR) 10Abijeet Patro: Add editcontentmodel right for metawiki translation administrators (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [05:29:39] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:47:23] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 251, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:49:41] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 252, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:50:51] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:59:14] (03PS3) 10KartikMistry: testwiki: Enable Section Translation on haw, la, ps and, xh Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832993 (https://phabricator.wikimedia.org/T317289) [06:00:04] kormat, marostegui, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T0600). [06:00:09] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:10:33] 10ops-codfw, 10DBA, 10Data-Persistence, 10Data-Persistence-Backup, and 2 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10Marostegui) [06:20:40] (03CR) 10Dzahn: [C: 03+2] P:gitlab::runner: $nameservers parameter type should match aliased [puppet] - 10https://gerrit.wikimedia.org/r/833046 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [06:23:12] (03CR) 10Dzahn: [V: 03+2 C: 03+2] buildkitd: Bump version to 0.10.4 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/830909 (owner: 10Dduvall) [06:30:57] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:44:39] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 3 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10wiki_willy) Yeah, it looks like we're just past the warranty period. @Papaul - do you want to try and see if we're still able to to submit a RMA? And if it won't let you, let me know... [06:57:53] Updating cxserver before backport deployment window.. I'm only one in the window with minor patch so far.. [06:58:01] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-09-15-113346-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/832989 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry) [07:00:04] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:44] * kart_ is here.. [07:01:38] (03Merged) 10jenkins-bot: Update cxserver to 2022-09-15-113346-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/832989 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry) [07:02:14] (03CR) 10KartikMistry: [C: 03+2] testwiki: Enable Section Translation on haw, la, ps and, xh Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832993 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry) [07:02:38] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [07:02:59] (03Merged) 10jenkins-bot: testwiki: Enable Section Translation on haw, la, ps and, xh Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832993 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry) [07:03:10] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [07:05:30] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [07:06:24] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [07:06:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:07:22] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [07:07:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:07:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:08:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:08:19] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [07:10:50] !log Updated cxserver to 2022-09-15-113346-production (T317289, T315209) [07:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:55] T315209: Parsoid clients should handle multi-valued rel attributes - https://phabricator.wikimedia.org/T315209 [07:10:55] T317289: Enable Content and Section translation on 6 Wikipedias - https://phabricator.wikimedia.org/T317289 [07:13:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:14:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:14:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:15:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:18:18] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:832993|testwiki: Enable Section Translation on haw, la, ps and, xh Wikipedias (T317289)]] (duration: 03m 46s) [07:18:21] T317289: Enable Content and Section translation on 6 Wikipedias - https://phabricator.wikimedia.org/T317289 [07:28:33] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:51:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:55:29] PROBLEM - SSH on mw1311.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:05] jnuche and dancy: (Dis)respected human, time to deploy MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T0800). Please do the needful. [08:07:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:12:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:22:49] (03PS1) 10Jbond: admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315 [08:24:32] (03CR) 10CI reject: [V: 04-1] admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315 (owner: 10Jbond) [08:25:59] (03PS2) 10Jbond: admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315 [08:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:33:13] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET namespaces) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:33:20] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [08:33:48] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet [08:35:20] !log Restarted CI Jenkins for plugin update [08:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:39] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: cast firmware_store_dir to Path [cookbooks] - 10https://gerrit.wikimedia.org/r/833318 [08:41:39] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: cast firmware_store_dir to Path [cookbooks] - 10https://gerrit.wikimedia.org/r/833318 (owner: 10Jbond) [08:43:46] !log awight@deploy1002 Started deploy [kartotherian/deploy@4759a78]: Merge "Update kartotherian to e3f3854" [08:45:31] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: cast firmware_store_dir to Path [cookbooks] - 10https://gerrit.wikimedia.org/r/833318 (owner: 10Jbond) [08:46:13] !log awight@deploy1002 Finished deploy [kartotherian/deploy@4759a78]: Merge "Update kartotherian to e3f3854" (duration: 02m 27s) [08:46:35] (03CR) 10Phuedx: [C: 03+1] Drop deprecated survey prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832464 (https://phabricator.wikimedia.org/T317862) (owner: 10Awight) [08:51:07] (03Abandoned) 10Milimetric: aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/829233 (owner: 10Milimetric) [08:56:45] RECOVERY - SSH on mw1311.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:59:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:04:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48682 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:27:13] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [09:31:05] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:51:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:07] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts sretest1002.eqiad.wmnet [10:01:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:53] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "should be good to deploy during the backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832999 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight) [10:28:50] (03PS4) 10Samtar: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) [10:30:39] (03CR) 10Samtar: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar) [11:01:03] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 3 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10jcrespo) @Ladsgroup @Marostegui Thanks for the help debugging. Until issues for db2098 are solved, I have setup db2100 (s8, and newly created s7 instance) to replace db2098 backups fun... [11:58:31] RECOVERY - snapshot of s7 in codfw on backupmon1001 is OK: Last snapshot for s7 at codfw (db2100) taken on 2022-09-20 09:48:03 (966 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:09:05] (03PS1) 10Gergő Tisza: Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833037 (https://phabricator.wikimedia.org/T310320) [12:09:51] (03PS1) 10Gergő Tisza: Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833038 (https://phabricator.wikimedia.org/T310320) [12:14:35] (03CR) 10Kosta Harlan: [C: 03+1] Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833037 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza) [12:15:38] (03CR) 10Kosta Harlan: [C: 03+1] Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833038 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza) [12:35:09] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:43:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:48:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:55:59] (03CR) 10Hashar: [C: 03+2] "Looks like that is working ;)" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/832518 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [12:56:28] (03Merged) 10jenkins-bot: Use gerrit-deploy for deployment on devtools [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/832518 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [12:56:47] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10CMyrick-WMF) The username I received via email is 39451. My shell username is cmyrick. When I am trying to execute kinit, I am having an error tha... [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1300). [13:00:05] Thiemo_WMDE and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1300) [13:00:21] I can deploy today [13:00:36] hi Thiemo_WMDE, tgr_: around? [13:00:48] I'm here. [13:01:04] ok, let's start with you then :) [13:01:08] (03PS2) 10Urbanecm: Enable Tech Wishes survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832999 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight) [13:01:13] (03CR) 10Urbanecm: [C: 03+2] Enable Tech Wishes survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832999 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight) [13:01:15] \o/ [13:01:19] (03PS3) 10Hashar: Gerrit v3.5.2 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824200 (https://phabricator.wikimedia.org/T307334) [13:01:31] Only configuration this time ;) [13:01:40] :) [13:01:59] (03Merged) 10jenkins-bot: Enable Tech Wishes survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832999 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight) [13:02:30] Thiemo_WMDE: pulled to mwdebug1001, can you check? [13:03:22] 1001. One second. [13:03:26] sure [13:03:51] Yes, works! [13:03:56] great, syncing [13:05:13] urbanecm: I'll be late, can deploy myself. [13:05:30] tgr_: ack, I'll ping you once done. [13:05:42] Or you can just merge them if you are okay with that, they don't need testing. [13:06:14] ah, yeah, it's just a schema version change [13:06:15] i'll sync that [13:06:22] (03CR) 10Urbanecm: [C: 03+2] Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833037 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza) [13:06:28] (03CR) 10Urbanecm: [C: 03+2] Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833038 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza) [13:06:58] (03PS3) 10Hashar: gerrit: remove unused mysql-connector-java lib [puppet] - 10https://gerrit.wikimedia.org/r/832344 [13:07:42] Can confirm the change is live now, btw. Thanks! [13:07:50] Thiemo_WMDE: it's not yet, last few seconds [13:08:25] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0b55db6f80df5f4c89f969332a6b31077a7172c4: Enable Tech Wishes survey on dewiki (T316676) (duration: 04m 12s) [13:08:29] T316676: Set up QuickSurvey for TechnicalWishes participatory development evaluation - https://phabricator.wikimedia.org/T316676 [13:08:30] Thiemo_WMDE: it's live now [13:08:37] Yea, works. [13:08:58] great! [13:09:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:09:59] (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:45] (03PS8) 10Hashar: gerrit: decouple scap and daemon users [puppet] - 10https://gerrit.wikimedia.org/r/832345 [13:13:29] (03PS4) 10Hashar: gerrit: change deployment user on devtools [puppet] - 10https://gerrit.wikimedia.org/r/832507 [13:14:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:16:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:16:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:16:35] PROBLEM - SSH on analytics1077.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:17:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:19:55] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:05] 10SRE, 10Phabricator: https://phab.wmflabs.org/ Is down - https://phabricator.wikimedia.org/T318158 (10Devnull) [13:25:36] !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@0e9fb6b]: (no justification provided) [13:25:47] !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@0e9fb6b]: (no justification provided) (duration: 00m 11s) [13:29:17] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:43] (03Merged) 10jenkins-bot: Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833037 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza) [13:32:49] (03CR) 10CI reject: [V: 04-1] Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833038 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza) [13:33:58] :/ [13:34:24] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Update HomepageModule schema version [extensions/GrowthExperiments] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833038 (https://phabricator.wikimedia.org/T310320) (owner: 10Gergő Tisza) [13:36:09] tgr_: syncing yours [13:37:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:38:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:38:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:39:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:39:57] !log urbanecm@deploy1002 Synchronized php-1.40.0-wmf.1/extensions/GrowthExperiments/extension.json: 1a27e05a7ca53a063d5f9e284d6a09546ac8691c: Update HomepageModule schema version (T310320) (duration: 03m 52s) [13:40:00] T310320: Account creation + Growth tools: improve UX for newcomers who create an account while mid-edit - https://phabricator.wikimedia.org/T310320 [13:43:37] !log urbanecm@deploy1002 Synchronized php-1.40.0-wmf.2/extensions/GrowthExperiments/extension.json: 1ac09d4709c645558f644a885fadc49c05cc04b9: Update HomepageModule schema version (T310320) (duration: 03m 39s) [13:43:46] tgr_: should be live [13:44:22] thanks urbanecm ! [13:44:25] np [13:44:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:45:02] the schema validation errors haven't stopped but these are frontend events so I imagine it will take a couple minutes. [13:45:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:45:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:46:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:48:01] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1189', diff saved to https://phabricator.wikimedia.org/P34884 and previous config saved to /var/cache/conftool/dbconfig/20220920-135006-ladsgroup.json [13:50:21] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:22] (03PS1) 10Hashar: gerrit: make homedir variable [puppet] - 10https://gerrit.wikimedia.org/r/833379 [13:57:35] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) 05Resolved→03Open @wiki_willy @Jclark-ctr looks like we are having memory issues again and the host crashed. Could it be the mainboard? ` [Mon Sep 19 13:20:29 2022] EDAC MC1: 1 UE memory read error... [14:00:29] !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@1a7c3b9]: (no justification provided) [14:00:44] !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@1a7c3b9]: (no justification provided) (duration: 00m 15s) [14:06:58] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) And the hardware logs: ` ------------------------------------------------------------------------------- Record: 80 Date/Time: 09/19/2022 13:20:27 Source: system Severity: Critical Descri... [14:08:39] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/833379 (owner: 10Hashar) [14:10:16] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [14:15:48] (03PS1) 10Marostegui: Revert "Revert "db1189: Disable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/833039 [14:16:29] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet [14:17:00] (03PS2) 10Marostegui: Revert "Revert "db1189: Disable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/833039 [14:17:51] RECOVERY - SSH on analytics1077.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:18:23] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:05] (03CR) 10Marostegui: [C: 03+2] Revert "Revert "db1189: Disable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/833039 (owner: 10Marostegui) [14:29:55] (03PS1) 10Hashar: gerrit: make daemon_user variable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/833385 [14:30:09] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:52] (03PS1) 10Ssingh: certspotter: remove deprecated Google CT logs [puppet] - 10https://gerrit.wikimedia.org/r/833387 [14:32:14] ^ the certspotter failures on the alerting host should be resolved by this [14:33:09] (03PS2) 10Hashar: gerrit: make homedir variable [puppet] - 10https://gerrit.wikimedia.org/r/833379 [14:33:48] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar) [14:34:42] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37299/console" [puppet] - 10https://gerrit.wikimedia.org/r/833387 (owner: 10Ssingh) [14:37:43] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:40:27] (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: remove deprecated Google CT logs [puppet] - 10https://gerrit.wikimedia.org/r/833387 (owner: 10Ssingh) [14:51:17] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:17] ^ should be resolved by 04ee08339 [14:53:37] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:00] 10SRE, 10SRE-Access-Requests: Requesting access to Eventlogs, Stats for Simulo-wikitech - https://phabricator.wikimedia.org/T318058 (10Ottomata) @awight I think you will need a WMF manager to approve/sponser this, as well as an expiry date MOA for this account. Also, please review https://wikitech.wikimedia... [15:03:08] 10SRE-Access-Requests, 10Data-Engineering-Operations: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Ottomata) [15:04:46] 10SRE-Access-Requests, 10Data-Engineering-Operations: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Ottomata) The rest of the access for this account was given in {T314117}. This request is adding Kerberos only. I approve. Doing now... [15:06:49] (03PS1) 10Ottomata: Set krb: present for alinebruenger [puppet] - 10https://gerrit.wikimedia.org/r/833399 (https://phabricator.wikimedia.org/T314117) [15:07:16] 10SRE-Access-Requests, 10Data-Engineering-Operations: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Ottomata) @Aline_Bruenger_WMDE you should have an email with instructions to login and set your kerberos password. Please confirm when you've done so. :) [15:10:42] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10Ottomata) Interesting! @BCornwall may have used your uid rather than your username when creating the kerberos principal. @CMyrick-WMF I have just... [15:11:27] (03CR) 10Ottomata: [C: 03+2] Set krb: present for alinebruenger [puppet] - 10https://gerrit.wikimedia.org/r/833399 (https://phabricator.wikimedia.org/T314117) (owner: 10Ottomata) [15:17:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:17:43] 10SRE-Access-Requests, 10Data-Engineering-Operations: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Siko_WMDE) Hi @Ottomata , I also would need Kerberos (siko, username: Siko WMDE), the rest of the access for my account was also already given (see https:/... [15:24:09] (03CR) 10Ottomata: [C: 03+1] admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315 (owner: 10Jbond) [15:27:13] 10SRE, 10serviceops, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10LSobanski) 05Open→03Stalled [15:27:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10LSobanski) [15:33:24] (03PS3) 10Jbond: admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315 [15:35:07] (03PS1) 10Ottomata: Set krb: present for siko [puppet] - 10https://gerrit.wikimedia.org/r/833404 (https://phabricator.wikimedia.org/T316766) [15:35:33] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Operations, 10Patch-For-Review: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Ottomata) Ah, sorry missed that. I can do it here. Link to {T315878}. You should have an email now. [15:38:06] (03CR) 10Ottomata: [C: 03+2] Set krb: present for siko [puppet] - 10https://gerrit.wikimedia.org/r/833404 (https://phabricator.wikimedia.org/T316766) (owner: 10Ottomata) [15:39:02] jouncebot nowandnext [15:39:02] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [15:39:02] In 0 hour(s) and 20 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1600) [15:40:58] (03PS1) 10Aqu: Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) [15:45:17] brett: Hi, I'm planning to do a quick, "urgent" mw-config patch to disable a QuickSurvey. [15:45:59] awight: I think you may have tagged the wrong person ^^; [15:46:41] oops, sorry! thcipriani ^ deployment request [15:47:12] jouncebot: now [15:47:12] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [15:47:23] jouncebot: next [15:47:23] In 0 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1600) [15:47:29] awight: go for it [15:48:13] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833411 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight) [15:48:27] ty! [15:48:50] 10SRE, 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10LSobanski) Needs consulting with Alex / Giuseppe before proceeding. [15:49:14] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833411 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight) [15:50:24] (03Merged) 10jenkins-bot: Disable Tech Wishes survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833411 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight) [15:51:47] 10SRE, 10SRE-OnFire, 10Patch-For-Review: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355 (10LSobanski) [15:52:22] (03PS1) 10Aqu: Deploy Spark 3 to production [puppet] - 10https://gerrit.wikimedia.org/r/833412 (https://phabricator.wikimedia.org/T312882) [15:55:10] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:833411|Disable Tech Wishes survey on dewiki (T316676)]] (duration: 03m 53s) [15:55:14] T316676: Set up QuickSurvey for TechnicalWishes participatory development evaluation - https://phabricator.wikimedia.org/T316676 [15:55:35] thcipriani: Done. Thanks again! o/ [15:55:48] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Disable Tech Wishes survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833411 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight) [15:57:53] (03CR) 10Ahmon Dancy: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar) [16:00:04] jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1600) [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:14] thcipriani: I forgot how to deploy. One more try. [16:00:38] awight: `scap deploy 833411` [16:00:45] or, not deploy.. backport [16:00:47] `scap backport 833411` [16:00:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10BCornwall) Indeed, I did use the uid. Thanks for handling that, @Ottomata. [16:01:31] dancy: Amazing, I didn't know about this yet! I was using this tool: https://deploy-commands.toolforge.org/bacc/833411 [16:01:52] But I failed to run "git rebase" so just spent 10 minutes deploying the old version of a file /o\ [16:01:57] I'm hoping deploy-commands.toolforge will be updated to recommend scap backport soon. [16:02:13] doh! [16:04:39] Reading through the source for `scap backport` now, very nice work! [16:04:41] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:833411|Disable Tech Wishes survey on dewiki (T316676)]] (take 2) (duration: 03m 42s) [16:04:45] T316676: Set up QuickSurvey for TechnicalWishes participatory development evaluation - https://phabricator.wikimedia.org/T316676 [16:04:49] okay, actually done. [16:06:10] * Lucas_WMDE will have to look at scap backport [16:06:34] We intend to do a writeup and advertise it more soon. [16:06:56] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Remove unused release_repo_update_mediawiki_releases_values_cmd [puppet] - 10https://gerrit.wikimedia.org/r/833414 [16:07:03] nice [16:09:00] !log awight@deploy1002 backport aborted: (duration: 00m 33s) [16:09:15] jouncebot now [16:09:15] For the next 0 hour(s) and 50 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1600) [16:09:15] haha it's a bit spammy [16:09:22] nod. [16:09:37] rzl/jbond: Can you deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/833414 please? [16:10:15] Looks like I need to install some gerrit credentials to use this. [16:10:41] dancy: we are all at an SRe summit currently, i should be free in 20-40 mins can it wait? [16:10:56] hmmm.. can you show me a transcript of what you're dealing with? It should all be automatic if you're in the `deployment` group. [16:11:06] jbond: Can do! thanks! [16:11:15] dancy: cool ill ping when free [16:12:53] (03PS1) 10Dduvall: gitlab_runner: Bump buildkitd version to 0.10.4-2 [puppet] - 10https://gerrit.wikimedia.org/r/833416 (https://phabricator.wikimedia.org/T318019) [16:15:54] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:30] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:23:13] (03CR) 10Jbond: [C: 03+2] scap.cfg.erb: Remove unused release_repo_update_mediawiki_releases_values_cmd [puppet] - 10https://gerrit.wikimedia.org/r/833414 (owner: 10Ahmon Dancy) [16:24:05] jbond: thanks! [16:28:03] dancy: ^^ merged, let me know if you want me to deploy it somewhere so you can test? [16:28:27] Sure, deploy1002 [16:31:05] dancy: done https://phabricator.wikimedia.org/P34886 [16:31:41] (03PS4) 10Jbond: admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315 [16:33:29] (03CR) 10Jbond: [C: 03+2] admin: add my self to the analytics-private-data group [puppet] - 10https://gerrit.wikimedia.org/r/833315 (owner: 10Jbond) [16:34:50] (03PS1) 10Ebernhardson: sre.wdqs.data-reload: puppet should stay disabled during data reload [cookbooks] - 10https://gerrit.wikimedia.org/r/833422 [16:34:52] (03PS1) 10Ebernhardson: sre.wdqs.data-reload: Add option to reuse munge [cookbooks] - 10https://gerrit.wikimedia.org/r/833423 [16:41:47] !log dancy@deploy1002 Started scap: testing, disregard [16:42:19] !log dancy@deploy1002 dancy: testing, disregard synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [16:42:25] !log dancy@deploy1002 Sync cancelled. [16:42:27] jbond: All is well. Thanks again. [16:44:27] dancy: np [16:50:21] (03Abandoned) 10Ebernhardson: sre.wdqs.data-reload: puppet should stay disabled during data reload [cookbooks] - 10https://gerrit.wikimedia.org/r/833422 (owner: 10Ebernhardson) [16:50:28] (03PS2) 10Ebernhardson: sre.wdqs.data-reload: Add option to reuse munge [cookbooks] - 10https://gerrit.wikimedia.org/r/833423 [17:06:46] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:09:08] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:09:30] (03CR) 10Phuedx: [C: 03+1] [WIP] Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832393 (https://phabricator.wikimedia.org/T317841) (owner: 10Awight) [17:10:30] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:11] (03PS1) 10Vlad.shapik: Add configurations to get an HTML test coverage report [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) [17:15:16] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:38:49] (03PS1) 10TChin: Bump eventgate-* image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/833428 (https://phabricator.wikimedia.org/T313202) [17:41:45] 10SRE, 10Analytics-Radar, 10Domains, 10Traffic-Icebox, 10WMF-General-or-Unknown: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10BCornwall) I think this ticket could do with some more clarification: Whi... [17:45:20] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:46:46] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Gehel) [17:55:47] jouncebot now [17:55:47] No deployments scheduled for the next 0 hour(s) and 4 minute(s) [17:55:52] jouncebot nowandnext [17:55:52] No deployments scheduled for the next 0 hour(s) and 4 minute(s) [17:55:52] In 0 hour(s) and 4 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T1800) [17:56:30] I'll roll the train forward about 20-30 minutes into the upcoming train window. Need to go afk for a bit. [17:57:32] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [17:59:54] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:10:55] (03PS1) 10Jdlrobson: Enable Nearby everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833435 (https://phabricator.wikimedia.org/T246493) [18:11:05] (03CR) 10CI reject: [V: 04-1] Enable Nearby everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833435 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [18:16:46] (03CR) 10Ottomata: [C: 03+2] Bump eventgate-* image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/833428 (https://phabricator.wikimedia.org/T313202) (owner: 10TChin) [18:18:19] (03PS2) 10Jdlrobson: Enable Nearby everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833435 (https://phabricator.wikimedia.org/T246493) [18:19:36] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [18:20:26] !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [18:21:32] !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [18:22:36] !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [18:22:49] !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [18:23:38] !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [18:26:25] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [18:27:03] !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [18:27:15] !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [18:28:17] !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [18:28:28] !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [18:29:21] !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [18:30:47] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [18:31:12] !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [18:31:20] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:31:33] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833438 (https://phabricator.wikimedia.org/T314191) [18:31:35] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833438 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [18:31:59] !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [18:32:17] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833438 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [18:32:55] !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [18:33:02] !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [18:33:47] !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [18:36:48] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.2 refs T314191 [18:36:52] T314191: 1.40.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T314191 [18:37:45] (03PS2) 10Jcrespo: dbbackups: Reenable notifications on db2100 after adding s7 [puppet] - 10https://gerrit.wikimedia.org/r/833124 (https://phabricator.wikimedia.org/T318062) [18:37:48] (03PS1) 10Jcrespo: dbbackups: Reduce memory consumption of db2100 (s7) [puppet] - 10https://gerrit.wikimedia.org/r/833439 [18:39:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:40:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:40:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:40:37] (03PS2) 10Jcrespo: dbbackups: Reduce memory consumption of db2100 (s7) [puppet] - 10https://gerrit.wikimedia.org/r/833439 [18:42:12] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reduce memory consumption of db2100 (s7) [puppet] - 10https://gerrit.wikimedia.org/r/833439 (owner: 10Jcrespo) [18:42:40] (03CR) 10Gehel: [C: 03+2] sre.wdqs.data-reload: Add option to reuse munge [cookbooks] - 10https://gerrit.wikimedia.org/r/833423 (owner: 10Ebernhardson) [18:44:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:45:43] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [18:46:07] !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [18:46:26] (03PS1) 10Jcrespo: dbbackups: Fix memory consumption of db2100 (s7) - followup [puppet] - 10https://gerrit.wikimedia.org/r/833440 [18:46:36] !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [18:46:44] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:47:17] !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [18:47:26] !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [18:47:32] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [18:47:34] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:48:06] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Fix memory consumption of db2100 (s7) - followup [puppet] - 10https://gerrit.wikimedia.org/r/833440 (owner: 10Jcrespo) [18:48:25] !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [18:50:39] !log restart db2100:s7 to apply new config [18:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:15] (03PS1) 10Bking: wdqs data-reload: fix arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/833441 [18:53:20] (03PS3) 10Jcrespo: dbbackups: Reenable notifications on db2100 after adding s7 [puppet] - 10https://gerrit.wikimedia.org/r/833124 (https://phabricator.wikimedia.org/T318062) [18:53:52] (03CR) 10Gehel: [C: 03+2] wdqs data-reload: fix arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/833441 (owner: 10Bking) [18:54:38] (03PS1) 10DLynch: Register the editattempt_block schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833442 (https://phabricator.wikimedia.org/T310390) [18:55:04] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications on db2100 after adding s7 [puppet] - 10https://gerrit.wikimedia.org/r/833124 (https://phabricator.wikimedia.org/T318062) (owner: 10Jcrespo) [18:58:19] (03Merged) 10jenkins-bot: wdqs data-reload: fix arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/833441 (owner: 10Bking) [19:05:11] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [19:07:27] (03PS5) 10Samtar: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) [19:07:35] (03PS6) 10Samtar: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) [19:11:49] (03CR) 10Samtar: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar) [19:27:05] (03CR) 10Ottomata: [C: 03+2] charts:eventstreams bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/831957 (https://phabricator.wikimedia.org/T292390) (owner: 10Gmodena) [19:31:07] (03Merged) 10jenkins-bot: charts:eventstreams bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/831957 (https://phabricator.wikimedia.org/T292390) (owner: 10Gmodena) [19:53:00] (03CR) 10Ahmon Dancy: [C: 03+1] "Looks reasonable to me. Analogous to https://gerrit.wikimedia.org/r/c/operations/puppet/+/810146" [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar) [19:53:41] (03CR) 10Ahmon Dancy: [C: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1003/37300/" [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar) [19:54:37] (03PS1) 10Gmodena: Bump eventstreams chart version. [deployment-charts] - 10https://gerrit.wikimedia.org/r/833447 (https://phabricator.wikimedia.org/T292390) [19:55:11] (03CR) 10Ottomata: [C: 03+2] Bump eventstreams chart version. [deployment-charts] - 10https://gerrit.wikimedia.org/r/833447 (https://phabricator.wikimedia.org/T292390) (owner: 10Gmodena) [19:56:30] (03PS3) 10Samtar: Enable Nearby everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833435 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [19:56:40] !log mforns@deploy1002 Started deploy [analytics/refinery@62d8262]: Regular analytics weekly train [analytics/refinery@62d8262] [19:58:59] (03Merged) 10jenkins-bot: Bump eventstreams chart version. [deployment-charts] - 10https://gerrit.wikimedia.org/r/833447 (https://phabricator.wikimedia.org/T292390) (owner: 10Gmodena) [20:00:04] RoanKattouw, Urbanecm, cjming, and TheresNoTime: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220920T2000). Please do the needful. [20:00:04] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] evening [20:00:57] * TheresNoTime can deploy! [20:02:00] hi TheresNoTime -- I can deploy Jon's patch -- and since it's the only one in the queue, it will be quick [20:02:07] RECOVERY - Check systemd state on cloudbackup2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:13] cjming: go for it :D [20:02:30] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [20:02:33] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [20:02:34] will do - thanks for being willing! [20:03:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833435 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [20:04:10] (03Merged) 10jenkins-bot: Enable Nearby everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833435 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [20:04:38] !log cjming@deploy1002 Started scap: Backport for [[gerrit:833435|Enable Nearby everywhere (T246493)]] [20:04:40] !log mforns@deploy1002 Finished deploy [analytics/refinery@62d8262]: Regular analytics weekly train [analytics/refinery@62d8262] (duration: 08m 00s) [20:04:42] T246493: [EPIC] Deploy NearbyPages everywhere - https://phabricator.wikimedia.org/T246493 [20:05:03] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:833435|Enable Nearby everywhere (T246493)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:05:23] !log mforns@deploy1002 Started deploy [analytics/refinery@62d8262] (thin): Regular analytics weekly train THIN [analytics/refinery@62d8262] [20:05:30] !log mforns@deploy1002 Finished deploy [analytics/refinery@62d8262] (thin): Regular analytics weekly train THIN [analytics/refinery@62d8262] (duration: 00m 07s) [20:09:09] PROBLEM - Check systemd state on cloudbackup2002 is CRITICAL: CRITICAL - degraded: The following units failed: block_sync-misc-project.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:11:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:11:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:13:41] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:833435|Enable Nearby everywhere (T246493)]] (duration: 09m 02s) [20:13:44] T246493: [EPIC] Deploy NearbyPages everywhere - https://phabricator.wikimedia.org/T246493 [20:15:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:15:39] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:19:21] shutting it down early today [20:19:24] !log end of UTC late backport window [20:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:30] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) @akosiaris, good news, Gabriele is working on this!!! @Jelto @JMeybohm, it seems the upgr... [20:27:08] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) BTW, since we merged the helm chart changes, eventstreams is currently undeployable. We r... [20:30:51] (03CR) 10AOkoth: [C: 03+1] gitlab: reduce backup_keep_time to 1d [puppet] - 10https://gerrit.wikimedia.org/r/829747 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [20:59:23] (03Abandoned) 10Brennen Bearnes: GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [21:00:22] (03Abandoned) 10Brennen Bearnes: scap: separate new rev perms from old rev perm cleanup [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/825911 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [21:03:08] (03PS1) 10Sbailey: Enable Linter write of namespace tag and template fields on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) [21:05:36] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Set initial value of beta_only_config_files [puppet] - 10https://gerrit.wikimedia.org/r/833455 (https://phabricator.wikimedia.org/T317242) [21:06:47] (03PS2) 10Ahmon Dancy: scap.cfg.erb: Set initial value of beta_only_config_files [puppet] - 10https://gerrit.wikimedia.org/r/833455 (https://phabricator.wikimedia.org/T317242) [21:15:49] 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10Platform Team Workboards (Clinic Duty Team): Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Milimetric) This is a nasty bug if Andrew happens to not be around, I just wanna ++ the tech debt val... [21:28:55] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10Technical-Debt: Fix rule violation in the lvs balancer role - https://phabricator.wikimedia.org/T264132 (10BCornwall) [21:29:02] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10Technical-Debt: Fix rule violation in the lvs balancer role - https://phabricator.wikimedia.org/T264132 (10BCornwall) a:03BCornwall [21:38:13] PROBLEM - Disk space on dumpsdata1001 is CRITICAL: DISK CRITICAL - free space: /data 855221 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops [22:04:38] (03PS1) 10Zabe: Replace deployment-db08 with deployment-db09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833461 (https://phabricator.wikimedia.org/T318126) [22:28:07] PROBLEM - SSH on analytics1077.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:30:15] (03PS1) 10Ebernhardson: cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) [22:30:48] (03CR) 10CI reject: [V: 04-1] cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [22:51:37] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Reopened ticket With dell