[00:01:38] RECOVERY - Check systemd state on aphlict1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:19:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:19:36] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:09:24] PROBLEM - SSH on mw1309.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:09:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [01:19:18] PROBLEM - Disk space on aphlict1001 is CRITICAL: DISK CRITICAL - free space: / 640 MB (3% inode=91%): /tmp 640 MB (3% inode=91%): /var/tmp 640 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops [01:20:30] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:45] (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [01:49:34] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T321254 (10phaultfinder) [01:55:34] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:14] RECOVERY - SSH on mw1309.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:34:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [03:39:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [03:49:38] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:57:34] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:33:30] (03CR) 10AntiCompositeNumber: "This patch should be abandoned." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [04:33:49] (03CR) 10Krinkle: [C: 04-1] "I had this dangling in local branch for a while, I guess things changed since then or I misread. I'll take another look." [puppet] - 10https://gerrit.wikimedia.org/r/842934 (owner: 10Krinkle) [04:50:38] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:18:44] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:24:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [05:31:32] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab2002), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:34:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [05:52:36] RECOVERY - Disk space on aphlict1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops [06:00:04] kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T0600). Please do the needful. [06:01:30] (03PS1) 10KartikMistry: testwiki: Enable Section Translation in 15 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844861 (https://phabricator.wikimedia.org/T319175) [06:19:36] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:36:12] * kart_ updating cxserver [06:36:25] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-10-18-161640-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/842812 (https://phabricator.wikimedia.org/T317224) (owner: 10KartikMistry) [06:39:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [06:40:24] (03Merged) 10jenkins-bot: Update cxserver to 2022-10-18-161640-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/842812 (https://phabricator.wikimedia.org/T317224) (owner: 10KartikMistry) [06:41:51] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:42:22] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:43:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] P:lvs::configuration: Store all site data in an accessible structure [puppet] - 10https://gerrit.wikimedia.org/r/844458 (owner: 10Jbond) [06:47:38] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:48:38] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:51:19] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:52:15] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:53:24] !log Updated Updated cxserver to 2022-10-18-161640-production (T317224, T319175, T319176) [06:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:32] T319175: Enable Content and Section translation on 6 more Wikipedias - https://phabricator.wikimedia.org/T319175 [06:53:32] T319176: Enable Section Translation on 9 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319176 [06:53:32] T317224: Enable Arabic MT support from Google to Egyptian Arabic Wikipedia - https://phabricator.wikimedia.org/T317224 [06:59:20] (03CR) 10Joal: Put fsimage backup on hdfs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu) [07:00:04] Amir1, apergos, and jnuche: (Dis)respected human, time to deploy UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T0700). Please do the needful. [07:00:04] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:12] good morning! [07:00:21] Good Morning! [07:00:35] we have one trainee signed up for the window [07:01:14] while there is only one patch listed (yours), the trainee intends to deploy theirs as well, one that was reverted during a previous attempt. [07:01:43] No worries. I'll go ahead with my patch first if that's fine. [07:01:52] yes, I was about to suggest going on ahead [07:02:31] (03PS5) 10Giuseppe Lavagetto: maintenance::wikidata: Update cron with lb and lb-pool [puppet] - 10https://gerrit.wikimedia.org/r/841148 (owner: 10Hoo man) [07:02:31] Cool. Thanks! [07:03:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844861 (https://phabricator.wikimedia.org/T319175) (owner: 10KartikMistry) [07:03:58] (03Merged) 10jenkins-bot: testwiki: Enable Section Translation in 15 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844861 (https://phabricator.wikimedia.org/T319175) (owner: 10KartikMistry) [07:04:28] !log kartik@deploy1002 Started scap: Backport for [[gerrit:844861|testwiki: Enable Section Translation in 15 Wikipedias (T319175 T319176)]] [07:04:34] T319176: Enable Section Translation on 9 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319176 [07:04:34] T319175: Enable Content and Section translation on 6 more Wikipedias - https://phabricator.wikimedia.org/T319175 [07:04:55] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:844861|testwiki: Enable Section Translation in 15 Wikipedias (T319175 T319176)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [07:08:37] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37660/console" [puppet] - 10https://gerrit.wikimedia.org/r/841148 (owner: 10Hoo man) [07:11:09] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:844861|testwiki: Enable Section Translation in 15 Wikipedias (T319175 T319176)]] (duration: 06m 41s) [07:11:15] T319176: Enable Section Translation on 9 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319176 [07:11:15] T319175: Enable Content and Section translation on 6 more Wikipedias - https://phabricator.wikimedia.org/T319175 [07:11:34] (03PS2) 10Elukey: coredns: upgrade to 1.8.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) [07:13:45] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] maintenance::wikidata: Update cron with lb and lb-pool [puppet] - 10https://gerrit.wikimedia.org/r/841148 (owner: 10Hoo man) [07:19:45] apergos: Oops. I forgot to ping you back. I'm done with deployment. [07:19:56] heh I thought that might be the case. [07:20:08] still waiting on our trainee to show up, so leaving the window open. [07:20:12] (Was updating tasks :/) [07:20:18] OK! [07:49:13] our trainee has not shown up; something must have come up. I'll ping him on the task and we'll reschedule. in the meantime, that's it for the window for today, folks! [07:49:23] !log UTC morning backport and config training window closed [07:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:33] see everyone next time! [07:52:40] !log +40 to k8s-mlserve on prometheus codfw [07:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] hashar and dduvall: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T0800). nyaa~ [08:00:21] ... [08:00:25] (03PS1) 10Filippo Giunchedi: Assign graphite role to new graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/844913 (https://phabricator.wikimedia.org/T318903) [08:01:25] I'm seeking a kind soul for a +1 on ^ [08:02:34] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844914 (https://phabricator.wikimedia.org/T320511) [08:02:36] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844914 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [08:03:22] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844914 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [08:05:14] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:28] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.6 refs T320511 [08:07:33] T320511: 1.40.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T320511 [08:13:54] (03CR) 10Filippo Giunchedi: [C: 03+2] Assign graphite role to new graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/844913 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [08:14:00] (03PS2) 10Filippo Giunchedi: Assign graphite role to new graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/844913 (https://phabricator.wikimedia.org/T318903) [08:26:33] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10jcrespo) >>! In T320786#8330568, @Jclark-ctr wrote: > Drive will arrive tomorrow. Can it be swapped when it arrives or will it need to be scheduled? Yes, host is currently depooled and not... [08:31:43] (03PS1) 10Filippo Giunchedi: graphite: extract graphite_hosts into hiera [puppet] - 10https://gerrit.wikimedia.org/r/844920 (https://phabricator.wikimedia.org/T318903) [08:34:09] (03CR) 10CI reject: [V: 04-1] graphite: extract graphite_hosts into hiera [puppet] - 10https://gerrit.wikimedia.org/r/844920 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [08:35:18] !log re-enabling Arelion on cr1-drmrs - T321157 [08:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:15] (03PS2) 10Filippo Giunchedi: graphite: extract graphite_hosts into hiera [puppet] - 10https://gerrit.wikimedia.org/r/844920 (https://phabricator.wikimedia.org/T318903) [08:37:06] _joe_: ^ FYI [08:37:31] <_joe_> XioNoX: ack, I'll tell you if things degrade :P [08:47:34] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:47:58] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/pcc-worker1002/37661/" [puppet] - 10https://gerrit.wikimedia.org/r/844920 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [08:48:05] (03PS3) 10Filippo Giunchedi: graphite: extract graphite_hosts into hiera [puppet] - 10https://gerrit.wikimedia.org/r/844920 (https://phabricator.wikimedia.org/T318903) [08:49:08] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:50:33] (03PS4) 10Filippo Giunchedi: graphite: extract graphite_hosts into hiera [puppet] - 10https://gerrit.wikimedia.org/r/844920 (https://phabricator.wikimedia.org/T318903) [08:53:29] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: extract graphite_hosts into hiera [puppet] - 10https://gerrit.wikimedia.org/r/844920 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [08:57:46] (03CR) 10Hashar: [C: 04-1] scap: automatize plugins handling (032 comments) [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [08:57:47] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 13335 [09:05:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 13335 [09:06:17] (03CR) 10Aqu: [V: 03+1] Put fsimage backup on hdfs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu) [09:06:31] (03Abandoned) 10Aqu: Put fsimage backup on hdfs [puppet] - 10https://gerrit.wikimedia.org/r/844471 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu) [09:17:22] (03PS10) 10Giuseppe Lavagetto: Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 [09:18:05] (03CR) 10CI reject: [V: 04-1] Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [09:28:52] (03PS3) 10Jelto: gitlab_runner: make allowed_images list configurable in hiera [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730) [09:35:06] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37662/console" [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto) [09:39:09] (03PS4) 10Stang: Fix broken wordmarks in Bengali projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124) [09:39:28] (03PS5) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [09:40:29] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: make allowed_images list configurable in hiera [puppet] - 10https://gerrit.wikimedia.org/r/844434 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto) [09:42:12] (03PS11) 10Giuseppe Lavagetto: Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 [09:42:51] (03CR) 10CI reject: [V: 04-1] Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [09:45:33] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 1239 [09:46:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 1239 [09:46:35] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2516 [09:47:22] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2516 [09:47:23] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2518 [09:48:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2518 [09:48:29] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2828 [09:48:37] (03CR) 10Jcrespo: [C: 03+1] "The backups part looks ok. Aside from the module postgres::backups, that may change in the future (but hopefully may not affect you), the " [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [09:49:12] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 2828 [09:49:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3605 [09:49:18] 10SRE, 10Wikibase Product Platform, 10Wikimedia-Apache-configuration: Incorrect handling of ETags taking precedence over timestamps in conditional requests - https://phabricator.wikimedia.org/T320241 (10Aklapper) [09:50:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3605 [09:50:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6327 [09:50:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6327 [09:50:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9269 [09:51:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9269 [09:51:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12200 [09:52:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12200 [09:52:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16265 [09:53:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16265 [09:53:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16591 [09:54:20] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16591 [09:54:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 24429 [09:55:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 24429 [09:55:31] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 36012 [09:56:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 36012 [09:56:09] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 58453 [09:56:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 58453 [09:56:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 63949 [09:57:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63949 [10:00:04] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T1000). [10:00:22] (03PS5) 10Stang: Fix broken wordmarks/taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T320944) [10:03:44] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:07:13] (03CR) 10Jbond: "LGTM from a puppet PoV" [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [10:08:47] 10SRE, 10Infrastructure-Foundations, 10netops: Ramp up SV1 IXP - https://phabricator.wikimedia.org/T321193 (10ayounsi) > I highlighted some noticeable SV1 peers as well in T280202#7766440 so we should reach out to them. 14 peering requests sent to those noticeable in SV1 but not in SV8 networks [10:10:48] (03PS12) 10Giuseppe Lavagetto: Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 [10:12:17] (03PS1) 10Jbond: hiera: add default for profile::docker::builder::known_uid_mappings [puppet] - 10https://gerrit.wikimedia.org/r/844928 [10:12:32] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera: add default for profile::docker::builder::known_uid_mappings [puppet] - 10https://gerrit.wikimedia.org/r/844928 (owner: 10Jbond) [10:13:46] (03PS18) 10Btullis: Add postgresql to an-db100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440) [10:15:04] (03PS2) 10Matthias Mullie: Fix value for wgQuickViewMediaRepositorySearchUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844485 [10:15:57] (03CR) 10Clément Goubert: [C: 03+1] "LGTM and should be pretty useful :)" [puppet] - 10https://gerrit.wikimedia.org/r/842850 (owner: 10Jbond) [10:25:00] (03PS1) 10Ayounsi: ulsfo: renumber HE BGP session to sv1 [homer/public] - 10https://gerrit.wikimedia.org/r/844929 (https://phabricator.wikimedia.org/T321193) [10:26:40] (03CR) 10Ayounsi: [C: 03+2] ulsfo: renumber HE BGP session to sv1 [homer/public] - 10https://gerrit.wikimedia.org/r/844929 (https://phabricator.wikimedia.org/T321193) (owner: 10Ayounsi) [10:26:42] (03PS2) 10Jbond: P:puppetdb: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/842854 [10:26:46] (03CR) 10Jbond: [C: 03+2] puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 (owner: 10Jbond) [10:29:49] (03Merged) 10jenkins-bot: ulsfo: renumber HE BGP session to sv1 [homer/public] - 10https://gerrit.wikimedia.org/r/844929 (https://phabricator.wikimedia.org/T321193) (owner: 10Ayounsi) [10:30:44] (03PS3) 10Jelto: Add registry.gitlab.com/security-products/**/* as allowed images [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961) (owner: 10SBassett) [10:34:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10Appledora) [10:37:35] (03CR) 10Jelto: [C: 03+2] "rebased to I112110d2553a41e839f9990c39ac2a872135c588" [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961) (owner: 10SBassett) [10:39:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [10:42:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hieradata: Add usernames for mw-debug k8s service [puppet] - 10https://gerrit.wikimedia.org/r/844491 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [10:44:25] (03PS3) 10Jelto: gitlab runner: allow golang:* images [puppet] - 10https://gerrit.wikimedia.org/r/842857 (https://phabricator.wikimedia.org/T320825) (owner: 10Brennen Bearnes) [10:46:28] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "You can just limit yourself to the psp change, we don't really need additional resources here. 2/3 pods should be more than enough per dat" [deployment-charts] - 10https://gerrit.wikimedia.org/r/844488 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [10:48:07] (03CR) 10Jelto: [C: 03+2] "rebased to I112110d2553a41e839f9990c39ac2a872135c588" [puppet] - 10https://gerrit.wikimedia.org/r/842857 (https://phabricator.wikimedia.org/T320825) (owner: 10Brennen Bearnes) [10:48:38] (03CR) 10Majavah: [C: 03+2] mono68: Remove expired DST Root CA X3 cert (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/844512 (https://phabricator.wikimedia.org/T311466) (owner: 10BryanDavis) [10:49:27] (03Merged) 10jenkins-bot: mono68: Remove expired DST Root CA X3 cert [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/844512 (https://phabricator.wikimedia.org/T311466) (owner: 10BryanDavis) [11:05:00] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Ladsgroup) Yeah I will take care of it, like moving out of replication and gracefully shutting it down. Just ping me before you want to start and I get it done ASAP (similar to the other on... [11:08:15] (03PS2) 10Clément Goubert: admin: Add mw-debug namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/844488 (https://phabricator.wikimedia.org/T321201) [11:12:50] (03CR) 10Clément Goubert: admin: Add mw-debug namespace (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/844488 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [11:16:04] (03CR) 10Clément Goubert: [C: 03+2] hieradata: Add usernames for mw-debug k8s service [puppet] - 10https://gerrit.wikimedia.org/r/844491 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [11:16:23] !log upload new pypuppetdb package [11:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:06] hi all after a suggestion from cla.ime on friday i have created a small tool to query puppetdb for a list of change resources grouped per host in a give time window. cold be usefull for (post) incident anlysis, debuggin or other... comments suggestions welcome https://wikitech.wikimedia.org/wiki/Puppet#pdb-changes_-_Audit_changes_during_a_specific_time_window [11:26:41] (03CR) 10Majavah: P:toolforge: use puppetdb for grid hba data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah) [11:30:28] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10jcrespo) @Ladsgroup FYI disk replacements on dbs are online. It may have some hit on io performance, but most of the point of HW RAIDs is online disk operations - rebuilds are a hot operati... [11:39:36] (03CR) 10Jbond: [C: 04-1] phabricator: use anchor/alias to add phab servers to dump clients list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [11:45:23] 10SRE, 10Wiki Loves Monuments FY 2022-2023, 10Wikimedia-Mailing-lists: Mailing list for WLM-int jury - https://phabricator.wikimedia.org/T321271 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup {{done}} [11:46:47] 10SRE, 10Wiki Loves Monuments FY 2022-2023, 10Wikimedia-Mailing-lists: Mailing list for WLM-int jury - https://phabricator.wikimedia.org/T321271 (10Ciell) Thanks! [11:47:47] !log roll update for libksba [11:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin: Add mw-debug namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/844488 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [11:54:27] (03PS1) 10Jbond: admin: extend contract for Sammy Tarling [puppet] - 10https://gerrit.wikimedia.org/r/844962 [11:54:45] (03CR) 10Jbond: [C: 03+2] admin: extend contract for Sammy Tarling [puppet] - 10https://gerrit.wikimedia.org/r/844962 (owner: 10Jbond) [11:55:28] (03PS1) 10Hokwelum: Change wm_enterprise_settings file permission [puppet] - 10https://gerrit.wikimedia.org/r/844963 [11:55:30] (03CR) 10Clément Goubert: [C: 03+2] admin: Add mw-debug namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/844488 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [11:57:35] (03CR) 10Ladsgroup: [C: 03+1] Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [11:59:24] (03Merged) 10jenkins-bot: admin: Add mw-debug namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/844488 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [12:01:48] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:04:19] !log Deploying new mw-debug namespace [12:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:14] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:05:28] !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:05:48] !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:05:50] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:06:34] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:06:55] (03CR) 10Btullis: [C: 03+2] Add postgresql to an-db100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/843502 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [12:07:25] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:08:04] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:08:27] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:08:48] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:08:50] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:21:52] 10SRE, 10Data-Engineering-Planning, 10Traffic: Add a rolled-up cache_status field to druid webrequest_sampled_128 - https://phabricator.wikimedia.org/T319344 (10LSobanski) [12:25:19] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 1280 [12:25:21] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 1280 [12:25:22] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 3856 [12:26:45] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3856 [12:26:46] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 4648 [12:27:25] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4648 [12:27:26] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 4766 [12:28:25] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4766 [12:28:26] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 4780 [12:28:27] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4780 [12:28:28] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 5650 [12:28:46] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 5650 [12:28:47] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 7091 [12:28:47] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7091 [12:28:48] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 7575 [12:28:57] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7575 [12:28:58] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 7713 [12:30:20] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7713 [12:30:21] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 8674 [12:31:02] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8674 [12:31:03] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 9498 [12:31:30] 10SRE, 10Znuny, 10serviceops-collab: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10LSobanski) [12:31:54] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9498 [12:31:55] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 9505 [12:32:13] (03CR) 10Giuseppe Lavagetto: Introduce the ClusterConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [12:33:01] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9505 [12:33:02] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 26803 [12:33:02] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 26803 [12:34:20] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly ORTS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10LSobanski) [12:35:31] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 42 [12:36:25] (03PS1) 10Jbond: admin: jbond user files [puppet] - 10https://gerrit.wikimedia.org/r/844970 [12:37:02] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 42 [12:37:02] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 714 [12:39:35] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 714 [12:39:36] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 2152 [12:39:50] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2152 [12:39:51] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 2647 [12:41:09] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 2647 [12:41:10] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 3292 [12:42:13] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3292 [12:42:14] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 4637 [12:43:14] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4637 [12:43:15] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 8966 [12:44:24] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8966 [12:44:25] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 10310 [12:46:53] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 10310 [12:46:54] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 11164 [12:47:29] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 11164 [12:47:30] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 11404 [12:48:05] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 11404 [12:48:05] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 13335 [12:48:35] (03PS6) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [12:50:28] !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 13335 [12:50:29] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 14061 [12:52:10] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 14061 [12:52:11] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 15169 [12:52:22] (03PS1) 10Urbanecm: DataTableCellMentee: Strike-through suppressed mentees [extensions/GrowthExperiments] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844043 (https://phabricator.wikimedia.org/T319185) [12:53:15] (03CR) 10Urbanecm: [C: 03+2] "will be deployed during upcoming B&C" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844043 (https://phabricator.wikimedia.org/T319185) (owner: 10Urbanecm) [12:55:14] !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 15169 [12:55:15] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 16276 [12:58:23] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16276 [12:58:24] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 16509 [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T1300). [13:00:05] koi, arlolra, and Urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] I can deploy today! [13:00:19] o [13:00:27] !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 16509 [13:00:28] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 19151 [13:00:31] good, I can't :) [13:00:44] * urbanecm waves to Lucas_WMDE anyway :) [13:01:41] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 19151 [13:01:42] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 29791 [13:02:53] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 29791 [13:02:54] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 32787 [13:04:01] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 32787 [13:04:01] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 36351 [13:04:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T320944) (owner: 10Stang) [13:05:01] koi: let's start with your patches :) [13:05:33] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 36351 [13:05:34] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 36692 [13:06:11] (03Merged) 10jenkins-bot: Fix broken wordmarks/taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T320944) (owner: 10Stang) [13:06:25] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:844054|Fix broken wordmarks/taglines (T320944 T321124 T321258)]] [13:06:32] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 36692 [13:06:32] T320944: Taglline of German Wikipedia on vector-2022 broken - https://phabricator.wikimedia.org/T320944 [13:06:32] T321124: Fix Bengali wordmarks & taglines - https://phabricator.wikimedia.org/T321124 [13:06:32] T321258: Mongolian Wikipedia vector 2022 logo letters misaligned - https://phabricator.wikimedia.org/T321258 [13:06:33] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 45102 [13:06:45] !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:844054|Fix broken wordmarks/taglines (T320944 T321124 T321258)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:06:55] koi: your first patch's at mwdebug1001, please test [13:07:06] looking [13:07:30] hi arlolra :) [13:07:43] hello. I am here if it's not too late [13:07:46] (03PS6) 10Urbanecm: zhwiki: Add 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842918 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [13:07:58] urbanecm: I tested logos for Bengali projects and mnwiki, this patch fix those problems, so LGTM [13:07:58] (03PS7) 10Urbanecm: zhwiki: Update 20 years logos in logos.php and IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842919 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [13:08:02] (03CR) 10Urbanecm: [C: 03+2] zhwiki: Add 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842918 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [13:08:05] (03CR) 10Urbanecm: [C: 03+2] zhwiki: Update 20 years logos in logos.php and IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842919 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [13:08:09] PROBLEM - Host db1168.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:08:09] PROBLEM - Host db1169.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:08:12] koi: great, syncing [13:08:15] PROBLEM - Host db1181.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:08:21] arlolra: it's ok. i'll ping you once i get to your patch! [13:08:31] PROBLEM - Host aqs1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:08:39] PROBLEM - Host aqs1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:08:43] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45102 [13:08:43] it might be kind of complex to purge all those wordmarks/taglines though 0 0 [13:08:44] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 132203 [13:08:53] PROBLEM - Host dbproxy1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:08:59] PROBLEM - Host dbproxy1021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:09:03] PROBLEM - Host dbproxy1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:09:09] PROBLEM - Host dbproxy1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:09:12] (03Merged) 10jenkins-bot: zhwiki: Add 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842918 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [13:09:13] PROBLEM - Host cloudcontrol1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:09:13] PROBLEM - Host cloudmetrics1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:09:15] (03Merged) 10jenkins-bot: zhwiki: Update 20 years logos in logos.php and IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842919 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [13:09:18] koi: it's fine, that's my/deployer's problem :). `git diff HEAD~ HEAD | grep -- '--- a/.*\.svg' | sed 's#^--- a#https://en.wikipedia.org#g'` generates the list fairly easily. [13:09:19] PROBLEM - Host ps1-c5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [13:09:53] PROBLEM - Host parse1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:09:57] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 132203 [13:09:58] !log ayounsi@cumin2002 START - Cookbook sre.network.peering with action 'email' for AS: 199524 [13:09:59] PROBLEM - Host ganeti1010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:10:27] PROBLEM - Host an-conf1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:10:27] PROBLEM - Host an-db1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:10:27] topranks, godog: it seems that ps1-c5-eqiad went down (see above alerts) [13:10:31] PROBLEM - Host an-test-worker1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:10:33] 10SRE, 10SRE-Access-Requests: Please add eigyan (essexigyan) to Restricted Group - https://phabricator.wikimedia.org/T318983 (10thcipriani) >>! In T318983#8315608, @Dzahn wrote: > Hi @thcipriani This would need your approval. 👋 Approved! @eigyan is interested in deploying and mwlog host access is a good place... [13:10:43] PROBLEM - Host db1120.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:10:43] PROBLEM - Host db1189.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:11:03] PROBLEM - Host ganeti1024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:11:07] PROBLEM - Host es1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:11:20] volans: not the PDU [13:11:23] PROBLEM - Host gitlab-runner1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:11:25] volans: probably the ToR switch [13:11:41] er ToR management switch I mean [13:11:43] PROBLEM - Host mw1485.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:11:43] PROBLEM - Host mw1482.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:11:43] PROBLEM - Host mw1484.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:11:43] PROBLEM - Host mw1483.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:11:43] PROBLEM - Host mw1486.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:11:43] PROBLEM - Host kubernetes1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:11:47] msw-c5-eqiad [13:11:51] yeah sorry, I meant that, but copy-pasted wrongly [13:11:54] (03PS2) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844074 (https://phabricator.wikimedia.org/T314318) [13:11:59] PROBLEM - Host wdqs1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:11:59] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 [13:11:59] PROBLEM - Host parse1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:12:09] PROBLEM - Host parse1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:12:15] PROBLEM - Host puppetmaster1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:12:18] (03Abandoned) 10Hokwelum: Change wm_enterprise_settings file permission [puppet] - 10https://gerrit.wikimedia.org/r/844963 (owner: 10Hokwelum) [13:12:25] PROBLEM - Host wcqs1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:12:28] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:844054|Fix broken wordmarks/taglines (T320944 T321124 T321258)]] (duration: 06m 03s) [13:12:33] PROBLEM - Host cloudmetrics1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:12:34] T320944: Taglline of German Wikipedia on vector-2022 broken - https://phabricator.wikimedia.org/T320944 [13:12:35] T321124: Fix Bengali wordmarks & taglines - https://phabricator.wikimedia.org/T321124 [13:12:35] T321258: Mongolian Wikipedia vector 2022 logo letters misaligned - https://phabricator.wikimedia.org/T321258 [13:12:36] koi: first patch's live [13:12:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842918 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [13:12:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842919 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [13:12:51] gah, thanks for heads up [13:12:56] (03CR) 10CI reject: [V: 04-1] DataTableCellMentee: Strike-through suppressed mentees [extensions/GrowthExperiments] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844043 (https://phabricator.wikimedia.org/T319185) (owner: 10Urbanecm) [13:12:58] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:842918|zhwiki: Add 20 years logos (T320859)]], [[gerrit:842919|zhwiki: Update 20 years logos in logos.php and IS.php (T320859)]] [13:13:02] T320859: Requesting temporary logo change for zh.wikipedia.org - https://phabricator.wikimedia.org/T320859 [13:13:13] I'm in a meeting, doesn't look like we're immediately on fire? cc topranks [13:13:17] !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:842918|zhwiki: Add 20 years logos (T320859)]], [[gerrit:842919|zhwiki: Update 20 years logos in logos.php and IS.php (T320859)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:13:21] PROBLEM - Host pc1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:13:35] godog: nah it's mostly noise and a blocker for re-image cookbooks and the like [13:13:37] koi: second patch's at mwdebug1001, please check [13:13:47] PROBLEM - Host snapshot1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:13:58] XioNoX: ack, please highlight or page when help is needed [13:14:00] * topranks looking [13:14:05] PROBLEM - Host db1145.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:14:05] PROBLEM - Host db1146.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:14:08] hi urbanecm, have you done the purge of the first patch? I found those wordmark/taglines unchange w/o mwdebug1001 [13:14:27] godog, topranks, I'll let oncall deal with it but ping me if you need help [13:14:37] koi: thanks for the reminder. "Purging 108 urls Done!" :) [13:14:37] done now [13:14:40] XioNoX: sure thanks! [13:14:59] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [13:15:59] urbanecm: thanks! The second patch got tested under vector-2010, vector-2022 and timeless, all looks fine [13:16:07] syncing [13:16:11] (03Merged) 10jenkins-bot: DataTableCellMentee: Strike-through suppressed mentees [extensions/GrowthExperiments] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844043 (https://phabricator.wikimedia.org/T319185) (owner: 10Urbanecm) [13:16:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 12200 [13:17:16] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) p:05Triage→03Medium [13:17:52] topranks, godog: if you need a quick way to get the list of hosts you can use either netbox ( https://netbox.wikimedia.org/dcim/racks/21/ ) or cumin ('R:motd::message%message ~ ".*eqiad and rack C5"') [13:18:06] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 12200 [13:18:24] volans: thanks! yeah netbox would be my go-to, but I like the look of that cumin command I will give it a whirl [13:18:34] fwiw confirmed switch is down, port hard down on msw1-eqiad [13:18:43] I'll open a task shortly and ack the alerts [13:19:38] scap shows this warning [13:19:55] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:842918|zhwiki: Add 20 years logos (T320859)]], [[gerrit:842919|zhwiki: Update 20 years logos in logos.php and IS.php (T320859)]] (duration: 06m 57s) [13:20:00] T320859: Requesting temporary logo change for zh.wikipedia.org - https://phabricator.wikimedia.org/T320859 [13:20:24] https://www.irccloud.com/pastebin/Agm7uCOJ/ [13:20:43] !log btullis@cumin1001 START - Cookbook sre.postgresql.postgres-init [13:21:01] !log btullis@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [13:21:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844043 (https://phabricator.wikimedia.org/T319185) (owner: 10Urbanecm) [13:21:43] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:844043|DataTableCellMentee: Strike-through suppressed mentees (T319185)]] [13:21:48] T319185: Suppressed accounts should be removed from Mentor dashboard - https://phabricator.wikimedia.org/T319185 [13:22:03] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:844043|DataTableCellMentee: Strike-through suppressed mentees (T319185)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:22:35] hi urbanecm, there's some problem with the zhwiki logo patch [13:22:57] https://zh.wikipedia.org/wiki/Wikipedia:%E9%A6%96%E9%A1%B5?useskin=timeless [13:23:10] koi: can you be more specific please? [13:23:15] !log btullis@cumin1001 START - Cookbook sre.postgresql.postgres-init [13:23:17] the file wikipedia-zh-20.svg does not exist [13:23:21] https://zh.wikipedia.org/static/images/icons/wikipedia-zh-20.svg [13:23:22] !log btullis@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [13:23:33] it does exist on my side [13:24:11] koi: can it be clientside cache? [13:24:19] (ie. what does ctrl+shift+r do?) [13:24:41] oh it exist now, seems problem of my side :( [13:25:39] great! [13:25:57] (03CR) 10Urbanecm: [C: 03+2] Disable wgParserEnableLegacyMediaDOM on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844074 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:26:41] (03Merged) 10jenkins-bot: Disable wgParserEnableLegacyMediaDOM on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844074 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:27:03] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:844043|DataTableCellMentee: Strike-through suppressed mentees (T319185)]] (duration: 05m 18s) [13:27:08] T319185: Suppressed accounts should be removed from Mentor dashboard - https://phabricator.wikimedia.org/T319185 [13:27:29] !log btullis@cumin1001 START - Cookbook sre.postgresql.postgres-init [13:28:15] !log btullis@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [13:28:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844074 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:28:59] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:844074|Disable wgParserEnableLegacyMediaDOM on viwiki (T314318)]] [13:29:04] T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318 [13:29:19] !log urbanecm@deploy1002 urbanecm and arlolra: Backport for [[gerrit:844074|Disable wgParserEnableLegacyMediaDOM on viwiki (T314318)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:29:32] arlolra: hi, your patch is at mwdebug1001. can you test it? [13:29:39] sure [13:29:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10Isaac) [13:30:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10Isaac) ^^ Just unchecking the SSH key box so SRE can do that. [13:30:13] RECOVERY - Host ganeti1010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [13:30:14] RECOVERY - Host ganeti1024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms [13:30:17] RECOVERY - Host wcqs1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms [13:30:25] RECOVERY - Host cloudmetrics1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.12 ms [13:30:49] RECOVERY - Host pc1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms [13:31:09] RECOVERY - Host snapshot1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.03 ms [13:31:13] 10SRE, 10ops-eqiad: msw-c5-eqiad offline - https://phabricator.wikimedia.org/T321311 (10cmooney) p:05Triage→03High [13:31:43] RECOVERY - Host db1169.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.10 ms [13:31:46] urbanecm: lgtm [13:31:52] thanks, syncing [13:32:13] RECOVERY - Host dbproxy1019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.07 ms [13:32:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 7843 [13:32:39] RECOVERY - Host dbproxy1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms [13:32:41] RECOVERY - Host dbproxy1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [13:32:57] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 7843 [13:33:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 20115 [13:33:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 20115 [13:34:11] RECOVERY - Host an-db1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.10 ms [13:34:11] RECOVERY - Host an-conf1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.07 ms [13:34:27] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 1239 [13:34:28] I'm scratching my head here, msw-c5-eqiad's been hard down since 13:05, not sure why we are seeing those recoveries [13:35:11] is it back up? [13:35:59] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:844074|Disable wgParserEnableLegacyMediaDOM on viwiki (T314318)]] (duration: 06m 59s) [13:36:03] T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318 [13:36:05] arlolra: should be all lie! [13:36:07] *live [13:36:12] anything else? [13:36:23] (03PS1) 10Ssingh: aptrepo: add thirdparty/haproxy24 for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/844983 (https://phabricator.wikimedia.org/T321309) [13:36:23] thank you! [13:36:31] nothing from me [13:37:04] volans: no, neither has the port bounced up/down at any point looking at the msw logs [13:37:17] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37663/console" [puppet] - 10https://gerrit.wikimedia.org/r/844983 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:37:41] !log UTC afternoon B&C window done [13:38:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 1239 [13:38:59] PROBLEM - Host sretest1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:33] (03PS1) 10Hokwelum: Create temporary dir for html dumps [puppet] - 10https://gerrit.wikimedia.org/r/844984 [13:40:14] (03CR) 10CI reject: [V: 04-1] Create temporary dir for html dumps [puppet] - 10https://gerrit.wikimedia.org/r/844984 (owner: 10Hokwelum) [13:40:25] RECOVERY - Host sretest1001 is UP: PING OK - Packet loss = 0%, RTA = 1.45 ms [13:40:44] (03PS1) 10Jelto: gitlab_runner: restart gitlab-runner gracefully [puppet] - 10https://gerrit.wikimedia.org/r/844985 [13:43:03] ACKNOWLEDGEMENT - Host an-test-worker1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Hosts down due to failure of rack C5 mgmt switch msw-c5-eqiad - The acknowledgement expires at: 2022-10-21 13:42:26. [13:43:03] ACKNOWLEDGEMENT - SSH on aqs1013.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Cathal Mooney Hosts down due to failure of rack C5 mgmt switch msw-c5-eqiad - The acknowledgement expires at: 2022-10-21 13:42:26. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:43:03] ACKNOWLEDGEMENT - Host aqs1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Hosts down due to failure of rack C5 mgmt switch msw-c5-eqiad - The acknowledgement expires at: 2022-10-21 13:42:26. [13:43:03] ACKNOWLEDGEMENT - Host aqs1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Hosts down due to failure of rack C5 mgmt switch msw-c5-eqiad - The acknowledgement expires at: 2022-10-21 13:42:26. [13:43:11] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37664/console" [puppet] - 10https://gerrit.wikimedia.org/r/844985 (owner: 10Jelto) [13:44:55] back, anything I can help with and/or bounce ideas topranks ? [13:45:12] godog: thanks! no I think all is in order, I opened task https://phabricator.wikimedia.org/T321311 [13:45:20] pinged the dc-ops folks in that channel there [13:45:48] (03CR) 10Jelto: [V: 03+1] "I'm not 100% sure if it makes sense to add the signal in the systemd::service definition or if it should be added to the gitlab-runner.ser" [puppet] - 10https://gerrit.wikimedia.org/r/844985 (owner: 10Jelto) [13:45:48] ack! thanks, task LGTM [13:45:51] the Icinga apparent recoveries did confuse me, maybe you know why they came in, the switch was dead ever since so it wasn't a genuine recovery [13:46:19] for sure, I'll check those [13:46:42] yeat not really important, but a mystery for sure [13:49:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:49:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:49:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T321312)', diff saved to https://phabricator.wikimedia.org/P35620 and previous config saved to /var/cache/conftool/dbconfig/20221020-134937-ladsgroup.json [13:50:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET clusterinformations) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:51:47] yeah I'm definitely confused and so is icinga clearly, I can't ping e.g. ganeti1010.mgmt from alert1001 and icinga is thinking it is up at the moment [13:52:09] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8966 [13:53:09] (03PS2) 10Hokwelum: Create temporary dir for html dumps [puppet] - 10https://gerrit.wikimedia.org/r/844984 [13:53:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8966 [13:53:18] godog: address 0.0.0.0 [13:53:44] in the config file [13:54:08] $ grep -c '0.0.0.0' objects/puppet_hosts.cfg [13:54:08] 13 [13:54:27] sigh, that explains, thanks volans [13:54:48] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 2906 [13:55:24] it comes from 'ipaddress' in $facts['ipmi_lan'] [13:55:58] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (GET clusterinformations) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:55:58] and indeed facter -p has ipaddress => "0.0.0.0" [13:56:05] in the ipmi_lan fact [13:56:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T321312)', diff saved to https://phabricator.wikimedia.org/P35621 and previous config saved to /var/cache/conftool/dbconfig/20221020-135605-ladsgroup.json [13:56:24] volans: thanks for solving that! [13:56:40] it comes from modules/ipmi/lib/facter/ipmi.rb [13:57:13] that comes from /usr/sbin/bmc-config [13:57:20] !log building production-images on build2001 - to build spark T318730 [13:57:21] in the case of ganeti1010, not sure about the others [13:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:25] T318730: Add spark and spark-operator images to operations/docker-images/production-images - https://phabricator.wikimedia.org/T318730 [13:57:50] I suspect it'll be the same thing yeah [13:58:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 2906 [13:58:40] on one hand I'm happy that we're close to be moving the mgmt checks to prometheus and netbox-hiera, on the other I'm mildly concerned about the ipmi/facter/bmc-config rabbit hole [13:59:24] RECOVERY - Host ps1-c5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.20 ms [13:59:34] RECOVERY - Host db1168.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms [13:59:36] (03PS2) 10Ssingh: aptrepo: add thirdparty/haproxy24 for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/844983 (https://phabricator.wikimedia.org/T321309) [13:59:50] RECOVERY - Host db1181.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [13:59:52] Ok the above are genuine recoveries [14:00:00] RECOVERY - Host db1146.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.02 ms [14:00:16] RECOVERY - Host aqs1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms [14:00:22] RECOVERY - Host aqs1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.81 ms [14:00:32] RECOVERY - Host dbproxy1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [14:00:36] RECOVERY - Host cloudcontrol1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.05 ms [14:00:47] (03CR) 10Btullis: [C: 03+2] Add a default value of undefined for the docker uid hash [puppet] - 10https://gerrit.wikimedia.org/r/844484 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [14:00:58] (KubernetesAPILatency) resolved: (9) High Kubernetes API latency (GET clusterinformations) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:01:21] volans: happy to file a task for the ipmi_lan fact and friends unless you are doing it too ? [14:01:44] RECOVERY - Host parse1013.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 412.97 ms [14:02:02] RECOVERY - Host db1145.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [14:02:05] godog: no, I was not, go ahead, was still looking at it [14:02:08] RECOVERY - Host an-test-worker1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [14:02:18] RECOVERY - Host db1120.mgmt is UP: PING OK - Packet loss = 0%, RTA = 19.06 ms [14:02:18] RECOVERY - Host db1189.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [14:02:21] and now bmc-config got back the config [14:02:39] volans: my guess is that when the idrac port is hard down that tool returns 0.0.0.0 as the IP [14:02:39] godog: but I can edit with the outputs of the commands once you've opened it [14:02:47] wild, ok doing so now [14:03:16] RECOVERY - Host parse1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [14:03:26] RECOVERY - Host puppetmaster1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [14:03:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [14:03:36] RECOVERY - Host es1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [14:03:36] RECOVERY - Host kubernetes1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [14:03:36] RECOVERY - Host gitlab-runner1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [14:03:38] RECOVERY - Host wdqs1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [14:03:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [14:03:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [14:04:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [14:04:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:04:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:04:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T318950)', diff saved to https://phabricator.wikimedia.org/P35622 and previous config saved to /var/cache/conftool/dbconfig/20221020-140423-ladsgroup.json [14:04:28] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:04:46] 10SRE, 10Wikimedia-SVG-rendering, 10noc.wikimedia.org, 10serviceops-radar, 10Patch-For-Review: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10akosiaris) An update: The first docker images of Thumbor are around.... [14:04:48] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-db1001.eqiad.wmnet [14:04:59] 10SRE, 10Infrastructure-Foundations: bmc-config (and thus ipmi_lan fact) returns 0.0.0.0 under certain conditions - https://phabricator.wikimedia.org/T321314 (10fgiunchedi) [14:05:06] ^ the task [14:05:16] (03CR) 10Alexandros Kosiaris: [C: 04-1] "It should be updated I 'd say. Thumbor docker images are now available, I 've commented in the linked task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [14:05:50] RECOVERY - Host mw1482.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [14:05:50] RECOVERY - Host mw1483.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [14:05:50] RECOVERY - Host mw1484.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [14:05:50] RECOVERY - Host mw1485.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [14:05:50] RECOVERY - Host mw1486.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [14:05:50] RECOVERY - Host parse1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [14:06:00] RECOVERY - Host cloudmetrics1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [14:06:08] 10SRE, 10ops-eqiad: msw-c5-eqiad offline - https://phabricator.wikimedia.org/T321311 (10Jclark-ctr) msw-c5-eqiad unresponsive. utilized previous decom switch to bring management connection back online. netbox updated [14:06:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T318950)', diff saved to https://phabricator.wikimedia.org/P35623 and previous config saved to /var/cache/conftool/dbconfig/20221020-140633-ladsgroup.json [14:08:46] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host etherpad1003.eqiad.wmnet [14:09:00] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host etherpad1003.eqiad.wmnet [14:09:18] 10SRE, 10ops-eqiad: msw-c5-eqiad offline - https://phabricator.wikimedia.org/T321311 (10cmooney) 05Open→03Resolved a:03cmooney Awesome @Jclark-ctr thanks for the speedy response! I can confirm port is back up: ` Oct 20 13:58:27 msw1-eqiad mib2d[2003]: SNMP_TRAP_LINK_UP: ifIndex 551, ifAdminStatus up(1)... [14:09:54] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host etherpad1003.eqiad.wmnet [14:11:09] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1001.eqiad.wmnet [14:11:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P35624 and previous config saved to /var/cache/conftool/dbconfig/20221020-141112-ladsgroup.json [14:13:39] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad1003.eqiad.wmnet [14:14:15] (03CR) 10Jelto: [C: 03+1] "I like to also check the ref_protected claim." [puppet] - 10https://gerrit.wikimedia.org/r/844513 (owner: 10Dduvall) [14:14:49] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Jclark-ctr) Replaced Failed drive [14:14:55] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Jclark-ctr) 05Open→03Resolved [14:16:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [14:17:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:17:26] (03CR) 10Hashar: [C: 04-1] scap: automatize plugins handling (032 comments) [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [14:17:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:17:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T318955)', diff saved to https://phabricator.wikimedia.org/P35625 and previous config saved to /var/cache/conftool/dbconfig/20221020-141736-ladsgroup.json [14:17:42] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:19:51] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] kubernetes mediawiki config: Remove nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844459 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [14:19:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T318955)', diff saved to https://phabricator.wikimedia.org/P35626 and previous config saved to /var/cache/conftool/dbconfig/20221020-141956-ladsgroup.json [14:20:05] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10jcrespo) Thank you, jclark! Waiting now for the disk to be fully rebuilt. [14:20:17] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Ladsgroup) Thanks! [14:20:34] RECOVERY - Host mw1314.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [14:20:41] (03PS3) 10Hokwelum: Create temporary dir for html dumps [puppet] - 10https://gerrit.wikimedia.org/r/844984 (https://phabricator.wikimedia.org/T319269) [14:20:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:21:09] (03CR) 10JMeybohm: [C: 03+2] Provide the cluster_cidr to kube-proxy in wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/844449 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [14:21:11] (03PS9) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) [14:21:13] (03CR) 10JMeybohm: [C: 03+2] Provide the cluster_cidr to kube-proxy in wikikube eqiad [puppet] - 10https://gerrit.wikimedia.org/r/844450 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [14:21:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P35627 and previous config saved to /var/cache/conftool/dbconfig/20221020-142139-ladsgroup.json [14:22:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:22:46] 10SRE, 10Infrastructure-Foundations: bmc-config (and thus ipmi_lan fact) returns 0.0.0.0 under certain conditions - https://phabricator.wikimedia.org/T321314 (10Volans) The icinga config for the mgmt hosts is generated by `modules/monitoring/manifests/host.pp` that has: ` address => $facts['ipmi_... [14:22:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [14:22:56] 10SRE, 10Infrastructure-Foundations: bmc-config (and thus ipmi_lan fact) returns 0.0.0.0 under certain conditions - https://phabricator.wikimedia.org/T321314 (10Volans) p:05Triage→03Medium [14:23:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [14:23:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:23:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:23:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T321312)', diff saved to https://phabricator.wikimedia.org/P35628 and previous config saved to /var/cache/conftool/dbconfig/20221020-142331-ladsgroup.json [14:23:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.484 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:24:26] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:26:04] 10SRE-tools, 10Icinga, 10Infrastructure-Foundations: get-raid-status-perccli should allow for commands to return non-zero exit code - https://phabricator.wikimedia.org/T320998 (10jcrespo) FYI, For the rebuilt after a disk change, the utility is working: ` sudo /usr/local/lib/nagios/plugins/get-raid-status-p... [14:26:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P35629 and previous config saved to /var/cache/conftool/dbconfig/20221020-142618-ladsgroup.json [14:29:23] (03PS1) 10Clément Goubert: kubernetes mediawiki config: Cleanup nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/844991 (https://phabricator.wikimedia.org/T321042) [14:30:01] (03CR) 10Hashar: [C: 04-1] "That is loosely based on what we did for Phabricator which has scap checks scripts managed by Puppet." [puppet] - 10https://gerrit.wikimedia.org/r/844523 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [14:31:41] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37665/console" [puppet] - 10https://gerrit.wikimedia.org/r/844991 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [14:31:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T321312)', diff saved to https://phabricator.wikimedia.org/P35630 and previous config saved to /var/cache/conftool/dbconfig/20221020-143148-ladsgroup.json [14:33:35] (03CR) 10JMeybohm: [C: 04-1] helmfile.d: add thumbor configuration (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:34:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Keep Nova API public in eqiad1 but restrict in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/844506 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [14:35:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P35631 and previous config saved to /var/cache/conftool/dbconfig/20221020-143502-ladsgroup.json [14:36:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P35632 and previous config saved to /var/cache/conftool/dbconfig/20221020-143646-ladsgroup.json [14:37:01] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 36012 [14:37:41] !log powerdown wdqs2005 for maintenance [14:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 36012 [14:38:19] XioNoX: ok [14:39:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:40:03] PROBLEM - Host wdqs2005 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T321312)', diff saved to https://phabricator.wikimedia.org/P35633 and previous config saved to /var/cache/conftool/dbconfig/20221020-144125-ladsgroup.json [14:41:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:41:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:41:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T321312)', diff saved to https://phabricator.wikimedia.org/P35634 and previous config saved to /var/cache/conftool/dbconfig/20221020-144150-ladsgroup.json [14:46:25] (03CR) 10Jbond: [C: 03+2] admin: jbond user files [puppet] - 10https://gerrit.wikimedia.org/r/844970 (owner: 10Jbond) [14:46:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P35635 and previous config saved to /var/cache/conftool/dbconfig/20221020-144655-ladsgroup.json [14:47:17] RECOVERY - Host wdqs2005 is UP: PING OK - Packet loss = 0%, RTA = 32.07 ms [14:48:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T321312)', diff saved to https://phabricator.wikimedia.org/P35636 and previous config saved to /var/cache/conftool/dbconfig/20221020-144823-ladsgroup.json [14:49:16] (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: sudo rules for scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/844523 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [14:49:23] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2005 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:50:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P35637 and previous config saved to /var/cache/conftool/dbconfig/20221020-145009-ladsgroup.json [14:50:10] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2005'] [14:50:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2005'] [14:50:59] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:51:18] ^^ acked alerts for wdqs2005 [14:51:44] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp-test1002.wikimedia.org [14:51:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T318950)', diff saved to https://phabricator.wikimedia.org/P35638 and previous config saved to /var/cache/conftool/dbconfig/20221020-145152-ladsgroup.json [14:51:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:51:58] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:52:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:52:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35639 and previous config saved to /var/cache/conftool/dbconfig/20221020-145214-ladsgroup.json [14:52:57] RECOVERY - IPMI Sensor Status on es2021 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:53:25] (03CR) 10Andrew Bogott: [C: 03+2] Keep Nova API public in eqiad1 but restrict in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/844506 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [14:54:49] 10SRE, 10ops-eqiad: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T321315 (10ops-monitoring-bot) [14:54:55] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul) [14:55:02] 10SRE, 10ops-codfw, 10DBA: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 (10Papaul) 05Open→03Resolved Replaced both power cords and upgrade IDRAC. System is back online [14:55:31] (03CR) 10Giuseppe Lavagetto: Simplify management of the request time limit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 (owner: 10Giuseppe Lavagetto) [14:55:52] (03PS3) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 [14:55:53] (03PS6) 10Giuseppe Lavagetto: Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 [14:56:34] jinxer-wm: nowandnext [14:56:38] jouncebot: nowandnext [14:56:38] No deployments scheduled for the next 1 hour(s) and 3 minute(s) [14:56:38] In 1 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T1600) [14:56:47] (03CR) 10CI reject: [V: 04-1] Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 (owner: 10Giuseppe Lavagetto) [14:57:02] (03CR) 10CI reject: [V: 04-1] Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 (owner: 10Giuseppe Lavagetto) [14:59:35] (03PS4) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 [14:59:37] (03PS7) 10Giuseppe Lavagetto: Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 [14:59:58] (03PS1) 10Urbanecm: GrowthExperiments: Reorder variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844994 [15:00:00] (03PS1) 10Urbanecm: GrowthExperiments: Define wgGEMentorshipUseIsActiveFlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844995 (https://phabricator.wikimedia.org/T318457) [15:00:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844994 (owner: 10Urbanecm) [15:00:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844995 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [15:01:15] (03Merged) 10jenkins-bot: GrowthExperiments: Reorder variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844994 (owner: 10Urbanecm) [15:01:20] (03Merged) 10jenkins-bot: GrowthExperiments: Define wgGEMentorshipUseIsActiveFlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844995 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [15:01:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35640 and previous config saved to /var/cache/conftool/dbconfig/20221020-150125-ladsgroup.json [15:01:30] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [15:01:34] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:844994|GrowthExperiments: Reorder variables]], [[gerrit:844995|GrowthExperiments: Define wgGEMentorshipUseIsActiveFlag (T318457)]] [15:01:38] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [15:01:53] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:844994|GrowthExperiments: Reorder variables]], [[gerrit:844995|GrowthExperiments: Define wgGEMentorshipUseIsActiveFlag (T318457)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [15:02:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P35641 and previous config saved to /var/cache/conftool/dbconfig/20221020-150201-ladsgroup.json [15:03:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P35642 and previous config saved to /var/cache/conftool/dbconfig/20221020-150329-ladsgroup.json [15:05:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T318955)', diff saved to https://phabricator.wikimedia.org/P35643 and previous config saved to /var/cache/conftool/dbconfig/20221020-150515-ladsgroup.json [15:05:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance [15:05:21] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:05:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance [15:05:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T318955)', diff saved to https://phabricator.wikimedia.org/P35644 and previous config saved to /var/cache/conftool/dbconfig/20221020-150537-ladsgroup.json [15:06:04] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:844994|GrowthExperiments: Reorder variables]], [[gerrit:844995|GrowthExperiments: Define wgGEMentorshipUseIsActiveFlag (T318457)]] (duration: 04m 30s) [15:06:31] (03PS1) 10Btullis: Correct missing comma in proxy options [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844997 (https://phabricator.wikimedia.org/T318730) [15:06:37] (03PS1) 10Hashar: gerrit: change scap user to gerrit-deploy [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) [15:06:49] RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:06:56] 10SRE, 10SRE-Access-Requests: Please add eigyan (essexigyan) to Restricted Group - https://phabricator.wikimedia.org/T318983 (10Dzahn) a:05thcipriani→03None [15:07:07] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [15:07:15] * urbanecm done [15:07:20] 10SRE, 10Discovery-Search (Current work): Provide compatible elasticsearch-oss (7.x) and wmf-elasticsearch-search-plugins for buster on WMF APT repo - https://phabricator.wikimedia.org/T318820 (10LSobanski) @MPhamWMF is anything needed from SRE here or can I remove the tag? [15:07:21] (03CR) 10JMeybohm: Remove references to deprecated kubeyaml (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [15:08:51] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-db1002.eqiad.wmnet [15:09:22] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet [15:09:34] 10SRE, 10Wikibase Product Platform, 10Wikimedia-Apache-configuration, 10serviceops: Incorrect handling of ETags taking precedence over timestamps in conditional requests - https://phabricator.wikimedia.org/T320241 (10LSobanski) [15:10:28] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T321254 (10Papaul) 05Open→03Resolved a:03Papaul I just added the interface in netbox and disable it. ` [edit interfaces] + xe-7/0/9 { + description DISABLED; + disable; + } ` [15:11:04] 10SRE, 10ops-codfw, 10decommission-hardware, 10Discovery-Search (Current work): decommission elastic20[25-36].codfw.wmnet - https://phabricator.wikimedia.org/T321243 (10Papaul) a:03Papaul [15:11:11] 10SRE, 10ops-codfw, 10DBA: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 (10jcrespo) Thank you, Papaul- that seems to have fixed it. ` 0d 0h 17m 5s 1/3 Sensor Type(s) Temperature, Power_Supply Status: OK ` [15:12:09] PROBLEM - Check systemd state on idp-test1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:16] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9304 [15:14:58] (03CR) 10Giuseppe Lavagetto: New organization of templates (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [15:15:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1002.eqiad.wmnet [15:15:17] (03PS1) 10Filippo Giunchedi: Add 'pybal_server_pooled' metric [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/845001 (https://phabricator.wikimedia.org/T321191) [15:15:55] (03CR) 10CI reject: [V: 04-1] Add 'pybal_server_pooled' metric [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/845001 (https://phabricator.wikimedia.org/T321191) (owner: 10Filippo Giunchedi) [15:16:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9304 [15:16:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P35645 and previous config saved to /var/cache/conftool/dbconfig/20221020-151631-ladsgroup.json [15:17:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T321312)', diff saved to https://phabricator.wikimedia.org/P35646 and previous config saved to /var/cache/conftool/dbconfig/20221020-151708-ladsgroup.json [15:17:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [15:17:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [15:17:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T321312)', diff saved to https://phabricator.wikimedia.org/P35647 and previous config saved to /var/cache/conftool/dbconfig/20221020-151731-ladsgroup.json [15:18:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P35648 and previous config saved to /var/cache/conftool/dbconfig/20221020-151836-ladsgroup.json [15:19:38] (03CR) 10Filippo Giunchedi: "Making lvsservice.removeServer have the side effect of setting a server as depooled made unrelated tests fail:" [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/845001 (https://phabricator.wikimedia.org/T321191) (owner: 10Filippo Giunchedi) [15:21:09] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host build2001.codfw.wmnet [15:22:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T321312)', diff saved to https://phabricator.wikimedia.org/P35649 and previous config saved to /var/cache/conftool/dbconfig/20221020-152249-ladsgroup.json [15:23:54] !log people2002, people1003 - rebooting one by one [15:24:41] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [15:25:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T318955)', diff saved to https://phabricator.wikimedia.org/P35650 and previous config saved to /var/cache/conftool/dbconfig/20221020-152503-ladsgroup.json [15:25:58] (03CR) 10Hashar: "I have amended the commit message with some clean up instructions extracted from the PPC output https://puppet-compiler.wmflabs.org/pcc-wo" [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [15:26:12] (03PS6) 10Hashar: gerrit: sudo rules for scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/844523 (https://phabricator.wikimedia.org/T317412) [15:26:31] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:27:10] !log etherpad - rebooting [15:28:57] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:30:36] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [15:30:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P35651 and previous config saved to /var/cache/conftool/dbconfig/20221020-153138-ladsgroup.json [15:32:00] (03CR) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [15:32:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 2152 [15:32:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 2152 [15:33:00] (03Abandoned) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [15:33:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T321312)', diff saved to https://phabricator.wikimedia.org/P35652 and previous config saved to /var/cache/conftool/dbconfig/20221020-153343-ladsgroup.json [15:33:47] (03CR) 10Btullis: [V: 03+2 C: 03+2] Correct missing comma in proxy options [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844997 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [15:33:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:34:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:34:12] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:34:52] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [15:36:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:37:19] (03CR) 10Dzahn: [C: 03+1] "SIGQUIT makes sense when reading the linked docs, ACK +1. I would have probably added it in the template but either should work. lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/844985 (owner: 10Jelto) [15:37:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P35653 and previous config saved to /var/cache/conftool/dbconfig/20221020-153755-ladsgroup.json [15:39:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:40:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:40:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T321312)', diff saved to https://phabricator.wikimedia.org/P35654 and previous config saved to /var/cache/conftool/dbconfig/20221020-154006-ladsgroup.json [15:40:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P35655 and previous config saved to /var/cache/conftool/dbconfig/20221020-154016-ladsgroup.json [15:41:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:41:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:42:09] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:44:13] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:44:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [15:44:46] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [15:46:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T321312)', diff saved to https://phabricator.wikimedia.org/P35656 and previous config saved to /var/cache/conftool/dbconfig/20221020-154635-ladsgroup.json [15:46:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35657 and previous config saved to /var/cache/conftool/dbconfig/20221020-154644-ladsgroup.json [15:46:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:46:50] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [15:47:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:47:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:47:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:47:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T318950)', diff saved to https://phabricator.wikimedia.org/P35658 and previous config saved to /var/cache/conftool/dbconfig/20221020-154724-ladsgroup.json [15:48:16] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) [15:49:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [15:49:36] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) > Mostly, don't worry about sudo, I can create it for you. Just noting that it will be "whatever-name-ale... [15:49:55] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [15:50:12] 10SRE, 10Znuny, 10serviceops-collab: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10Dzahn) @Arnoldokoth @jbond Looks to me like what has happened is that a new password has been added in `hieradata/common/profile/vrts.yaml` but there is still an old passwo... [15:50:20] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) 05Resolved→03Open [15:50:46] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10Dzahn) Thank you very much for doign this, @BTullis [15:52:04] (03PS1) 10Urbanecm: MenteeOverview: Fix link under "reverted" column [extensions/GrowthExperiments] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845009 (https://phabricator.wikimedia.org/T321321) [15:52:26] jouncebot: nowandnext [15:52:27] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [15:52:27] In 0 hour(s) and 7 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T1600) [15:52:31] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10Dzahn) @BTullis Done! Please check your mail. ` [lists1001:~] $ sudo mailman-wrapper create --owner btullis@wik... [15:52:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845009 (https://phabricator.wikimedia.org/T321321) (owner: 10Urbanecm) [15:53:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P35659 and previous config saved to /var/cache/conftool/dbconfig/20221020-155302-ladsgroup.json [15:55:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P35660 and previous config saved to /var/cache/conftool/dbconfig/20221020-155523-ladsgroup.json [15:55:58] !log phabricator (diffusion) - clicked "disable" and then "deactivate" on Blubber diffusion repo. it's now "inactive, publishing and syncing has been disabled https://phabricator.wikimedia.org/source/blubber/ T317820 [15:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:02] T317820: Archive the Blubber gerrit repo - https://phabricator.wikimedia.org/T317820 [15:59:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [15:59:59] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [16:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:35] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [16:00:51] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host failoid2002.codfw.wmnet [16:01:35] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [16:01:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P35661 and previous config saved to /var/cache/conftool/dbconfig/20221020-160142-ladsgroup.json [16:02:46] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [16:04:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:05:59] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet [16:08:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T321312)', diff saved to https://phabricator.wikimedia.org/P35662 and previous config saved to /var/cache/conftool/dbconfig/20221020-160808-ladsgroup.json [16:08:11] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [16:08:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [16:08:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [16:08:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T321312)', diff saved to https://phabricator.wikimedia.org/P35663 and previous config saved to /var/cache/conftool/dbconfig/20221020-160832-ladsgroup.json [16:10:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T318955)', diff saved to https://phabricator.wikimedia.org/P35664 and previous config saved to /var/cache/conftool/dbconfig/20221020-161029-ladsgroup.json [16:10:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance [16:10:35] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:10:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance [16:10:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [16:11:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [16:11:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T318955)', diff saved to https://phabricator.wikimedia.org/P35665 and previous config saved to /var/cache/conftool/dbconfig/20221020-161106-ladsgroup.json [16:11:51] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet [16:12:18] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [16:12:47] (03Merged) 10jenkins-bot: MenteeOverview: Fix link under "reverted" column [extensions/GrowthExperiments] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845009 (https://phabricator.wikimedia.org/T321321) (owner: 10Urbanecm) [16:13:04] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:845009|MenteeOverview: Fix link under "reverted" column (T321321)]] [16:13:08] T321321: MenteeOverview(Vue): Link under "Number of reverted edits" uses the mentorship-questions tag - https://phabricator.wikimedia.org/T321321 [16:13:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T318955)', diff saved to https://phabricator.wikimedia.org/P35666 and previous config saved to /var/cache/conftool/dbconfig/20221020-161326-ladsgroup.json [16:13:35] (03PS11) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [16:13:45] 10SRE, 10Infrastructure-Foundations, 10netops: Ramp up SV1 IXP - https://phabricator.wikimedia.org/T321193 (10ayounsi) 05Open→03Resolved a:03ayounsi Mass emailing is done: * SV8 peers that are only in SV8 -> Told them they can delete the SV8 sessions * SV8 peers that are also in other IXPs but all sess... [16:14:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:15:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T321312)', diff saved to https://phabricator.wikimedia.org/P35667 and previous config saved to /var/cache/conftool/dbconfig/20221020-161502-ladsgroup.json [16:15:56] (03CR) 10Dzahn: [C: 03+2] remove git-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/831627 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [16:16:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P35668 and previous config saved to /var/cache/conftool/dbconfig/20221020-161648-ladsgroup.json [16:17:27] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:845009|MenteeOverview: Fix link under "reverted" column (T321321)]] (duration: 04m 22s) [16:18:03] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [16:18:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 36351 [16:19:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:20:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 36351 [16:20:55] !log phab1001 (phabricator) - remove LVS IP from loopback - ip addr del 208.80.154.250 dev lo - T296022 [16:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:01] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [16:22:00] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [16:22:02] 10SRE, 10ops-codfw, 10decommission-hardware, 10Discovery-Search (Current work): decommission elastic20[25-36].codfw.wmnet - https://phabricator.wikimedia.org/T321243 (10Papaul) [16:22:31] !log phab1001 (phabricator) - remove LVS IP from loopback - ip addr del 2620:0:861:ed1a::3:16 dev lo - T296022 [16:24:14] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2002.codfw.wmnet [16:24:36] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:24:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T318950)', diff saved to https://phabricator.wikimedia.org/P35669 and previous config saved to /var/cache/conftool/dbconfig/20221020-162442-ladsgroup.json [16:26:10] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [16:26:12] !log phab2001 (phabricator) - remove LVS IPs from loopback - ip addr del 208.80.153.250 dev lo; ip addr del 2620:0:860:ed1a::3:fa dev lo - T296022 [16:26:18] !log building production-images on build2 for spark (second attempt) [16:26:40] !log correction: build2001 [16:27:17] the log bot is out but we did not notice because the confirmation log line has been removed [16:27:33] which was basically the concern when that happened [16:27:47] PROBLEM - SSH on db1121.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:28:21] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2002.codfw.wmnet [16:28:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P35670 and previous config saved to /var/cache/conftool/dbconfig/20221020-162833-ladsgroup.json [16:29:20] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10Dzahn) The very last step would then be to remove the line from the puppetized exim aliases in the private repo. [16:30:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P35671 and previous config saved to /var/cache/conftool/dbconfig/20221020-163008-ladsgroup.json [16:31:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T321312)', diff saved to https://phabricator.wikimedia.org/P35672 and previous config saved to /var/cache/conftool/dbconfig/20221020-163155-ladsgroup.json [16:32:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [16:32:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [16:32:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [16:32:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:32:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [16:32:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T321312)', diff saved to https://phabricator.wikimedia.org/P35673 and previous config saved to /var/cache/conftool/dbconfig/20221020-163236-ladsgroup.json [16:32:49] !log netbox - set IPs: 208.80.153.250, 208.80.154.250, 2620:0:860:ed1a::3:fa, 2620:0:861:ed1a::3:16 from active to 'deprecated' git-ssh - https://netbox.wikimedia.org/search/?q=git-ssh&obj_type= - T296022 [16:33:02] (03PS6) 10Bking: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/835596 (https://phabricator.wikimedia.org/T222349) (owner: 10Ryan Kemper) [16:33:20] (03CR) 10Bking: Mount labstore to wcqs/wdqs instance for dumps reload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/835596 (https://phabricator.wikimedia.org/T222349) (owner: 10Ryan Kemper) [16:33:45] (03CR) 10CI reject: [V: 04-1] Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/835596 (https://phabricator.wikimedia.org/T222349) (owner: 10Ryan Kemper) [16:34:35] XioNoX, volans: I am giving back public IPs. by setting them to "deprecated" in netbox. you like that, right [16:35:19] mutante: what do you mean by deprecated? [16:35:31] not active anymore [16:35:33] are they assigned to any host? [16:35:45] removed from loopback on the hosts that formerly used them [16:35:49] role: VIP [16:35:57] so they are not pingable, and not used, correct? [16:37:01] volans: that's right [16:37:15] formerly used by service behind LVS, now shut down [16:37:23] then they should simple be deleted from Netbox, you can mass-delete them from https://netbox.wikimedia.org/ipam/ip-addresses/?q=git-ssh and then run the sre.dns.netbox cookbook [16:37:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:37:36] why did I know it would be wrong :P [16:37:39] ok [16:38:27] was hoping that is what I don't have to do when I deprecate them [16:39:00] (03PS1) 10Andrew Bogott: openstack: make domain-aware [software/cumin] - 10https://gerrit.wikimedia.org/r/845004 [16:39:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T321312)', diff saved to https://phabricator.wikimedia.org/P35674 and previous config saved to /var/cache/conftool/dbconfig/20221020-163900-ladsgroup.json [16:39:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P35675 and previous config saved to /var/cache/conftool/dbconfig/20221020-163950-ladsgroup.json [16:40:09] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [16:42:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:43:26] (03CR) 10Btullis: "Fillippo, could you possibly change all of the instances of:" [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [16:43:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P35676 and previous config saved to /var/cache/conftool/dbconfig/20221020-164339-ladsgroup.json [16:45:05] (03CR) 10Btullis: analytics: move kerberos::systemd_timer and deps to send_mail param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [16:45:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P35677 and previous config saved to /var/cache/conftool/dbconfig/20221020-164515-ladsgroup.json [16:48:10] (03CR) 10CI reject: [V: 04-1] openstack: make domain-aware [software/cumin] - 10https://gerrit.wikimedia.org/r/845004 (owner: 10Andrew Bogott) [16:48:28] (03CR) 10Volans: openstack: make domain-aware (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/845004 (owner: 10Andrew Bogott) [16:54:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P35678 and previous config saved to /var/cache/conftool/dbconfig/20221020-165406-ladsgroup.json [16:54:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P35679 and previous config saved to /var/cache/conftool/dbconfig/20221020-165456-ladsgroup.json [16:58:35] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2003.codfw.wmnet [16:58:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T318955)', diff saved to https://phabricator.wikimedia.org/P35680 and previous config saved to /var/cache/conftool/dbconfig/20221020-165846-ladsgroup.json [16:58:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:59:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:59:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P35681 and previous config saved to /var/cache/conftool/dbconfig/20221020-165907-ladsgroup.json [17:00:05] bd808: Your horoscope predicts another unfortunate Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T1700). [17:00:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T321312)', diff saved to https://phabricator.wikimedia.org/P35682 and previous config saved to /var/cache/conftool/dbconfig/20221020-170021-ladsgroup.json [17:00:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [17:00:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [17:00:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T321312)', diff saved to https://phabricator.wikimedia.org/P35683 and previous config saved to /var/cache/conftool/dbconfig/20221020-170056-ladsgroup.json [17:06:15] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2003.codfw.wmnet [17:07:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T321312)', diff saved to https://phabricator.wikimedia.org/P35684 and previous config saved to /var/cache/conftool/dbconfig/20221020-170724-ladsgroup.json [17:09:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P35685 and previous config saved to /var/cache/conftool/dbconfig/20221020-170913-ladsgroup.json [17:10:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T318950)', diff saved to https://phabricator.wikimedia.org/P35686 and previous config saved to /var/cache/conftool/dbconfig/20221020-171003-ladsgroup.json [17:10:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [17:10:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [17:10:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T318950)', diff saved to https://phabricator.wikimedia.org/P35687 and previous config saved to /var/cache/conftool/dbconfig/20221020-171024-ladsgroup.json [17:12:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T318950)', diff saved to https://phabricator.wikimedia.org/P35688 and previous config saved to /var/cache/conftool/dbconfig/20221020-171234-ladsgroup.json [17:12:40] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [17:14:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P35689 and previous config saved to /var/cache/conftool/dbconfig/20221020-171439-ladsgroup.json [17:14:44] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:15:44] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 10310 [17:17:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 10310 [17:18:23] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:22:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P35690 and previous config saved to /var/cache/conftool/dbconfig/20221020-172231-ladsgroup.json [17:23:04] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [17:24:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T321312)', diff saved to https://phabricator.wikimedia.org/P35691 and previous config saved to /var/cache/conftool/dbconfig/20221020-172419-ladsgroup.json [17:24:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [17:24:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [17:24:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T321312)', diff saved to https://phabricator.wikimedia.org/P35692 and previous config saved to /var/cache/conftool/dbconfig/20221020-172445-ladsgroup.json [17:26:25] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:27:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P35693 and previous config saved to /var/cache/conftool/dbconfig/20221020-172741-ladsgroup.json [17:30:31] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:31:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T321312)', diff saved to https://phabricator.wikimedia.org/P35695 and previous config saved to /var/cache/conftool/dbconfig/20221020-173111-ladsgroup.json [17:35:35] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [17:36:51] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:37:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P35696 and previous config saved to /var/cache/conftool/dbconfig/20221020-173737-ladsgroup.json [17:42:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P35697 and previous config saved to /var/cache/conftool/dbconfig/20221020-174248-ladsgroup.json [17:42:57] hihi, what's the chances of being able to do an early backport to wmf.6? I rather daftly only backported https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/845012 to wmf.5 and only got about a day's worth of data from enwiki. It's not long to wait until the UTC late window, but more time deployed = more data = more better. [17:44:08] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:44:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P35698 and previous config saved to /var/cache/conftool/dbconfig/20221020-174453-ladsgroup.json [17:46:13] !log phabricator - disabling git-ssh URIs for repo 'phabricator-translations' https://phabricator.wikimedia.org/source/phabricator-translation - T296022 [17:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P35699 and previous config saved to /var/cache/conftool/dbconfig/20221020-174617-ladsgroup.json [17:46:18] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [17:48:22] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:49:51] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:51:59] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:52:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T321312)', diff saved to https://phabricator.wikimedia.org/P35700 and previous config saved to /var/cache/conftool/dbconfig/20221020-175244-ladsgroup.json [17:52:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [17:53:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [17:53:18] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4037.mgmt.ulsfo.wmnet with reboot policy FORCED [17:54:33] jouncebot: nowandnext [17:54:33] For the next 0 hour(s) and 5 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T1700) [17:54:33] In 0 hour(s) and 5 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T1800) [17:55:02] TheresNoTime: ask the train people I guess, since that's starting in just a bit [17:55:57] i'm not sure what the official policy stance is, but it's relatively common for people with access to deploy their own patches outside the official windows, as long as it's not on a friday/weekend and there's nothing else going on [17:56:45] Good idea. hashar and/or dduvall — can I deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/845012 (a backport to wmf.6) [17:57:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [17:57:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [17:57:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35701 and previous config saved to /var/cache/conftool/dbconfig/20221020-175726-ladsgroup.json [17:57:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T318950)', diff saved to https://phabricator.wikimedia.org/P35702 and previous config saved to /var/cache/conftool/dbconfig/20221020-175755-ladsgroup.json [17:57:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:58:01] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [17:58:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:58:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T318950)', diff saved to https://phabricator.wikimedia.org/P35703 and previous config saved to /var/cache/conftool/dbconfig/20221020-175817-ladsgroup.json [17:58:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P35704 and previous config saved to /var/cache/conftool/dbconfig/20221020-175854-ladsgroup.json [18:00:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P35705 and previous config saved to /var/cache/conftool/dbconfig/20221020-175959-ladsgroup.json [18:00:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance [18:00:04] hashar and dduvall: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T1800). [18:00:04] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [18:00:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance [18:00:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T318955)', diff saved to https://phabricator.wikimedia.org/P35706 and previous config saved to /var/cache/conftool/dbconfig/20221020-180021-ladsgroup.json [18:00:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T318950)', diff saved to https://phabricator.wikimedia.org/P35707 and previous config saved to /var/cache/conftool/dbconfig/20221020-180027-ladsgroup.json [18:01:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P35708 and previous config saved to /var/cache/conftool/dbconfig/20221020-180123-ladsgroup.json [18:02:18] is the train window even happening? everything is on wmf.6, no? [18:02:46] TheresNoTime: i believe hashar deployed during the EU window, so feel free to deploy now [18:02:58] dduvall: thank you :) [18:03:08] np [18:04:59] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:05:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35709 and previous config saved to /var/cache/conftool/dbconfig/20221020-180502-ladsgroup.json [18:05:53] !log Backporting [[gerrit:845012]] for T310974 to wmf.6 [18:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:58] T310974: Extend PageTriageMaxAge for unpatrolled articles at enwiki - https://phabricator.wikimedia.org/T310974 [18:06:22] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4037.mgmt.ulsfo.wmnet with reboot policy FORCED [18:06:49] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4037 [18:06:52] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4037 [18:09:55] !log samtar@deploy1002 Started scap: Backport for [[gerrit:845012|Hooks: Log to statsd when a page is noindex'd (T310974)]] [18:10:15] !log samtar@deploy1002 samtar and samtar: Backport for [[gerrit:845012|Hooks: Log to statsd when a page is noindex'd (T310974)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [18:10:22] * TheresNoTime is testing [18:11:14] yay tnt [18:15:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T318955)', diff saved to https://phabricator.wikimedia.org/P35710 and previous config saved to /var/cache/conftool/dbconfig/20221020-181520-ladsgroup.json [18:15:26] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [18:15:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P35711 and previous config saved to /var/cache/conftool/dbconfig/20221020-181533-ladsgroup.json [18:15:40] * TheresNoTime syncin' [18:16:30] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:16:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T321312)', diff saved to https://phabricator.wikimedia.org/P35712 and previous config saved to /var/cache/conftool/dbconfig/20221020-181630-ladsgroup.json [18:18:04] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:845012|Hooks: Log to statsd when a page is noindex'd (T310974)]] (duration: 08m 08s) [18:18:08] T310974: Extend PageTriageMaxAge for unpatrolled articles at enwiki - https://phabricator.wikimedia.org/T310974 [18:18:25] Done, thanks :) [18:18:26] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:20:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P35713 and previous config saved to /var/cache/conftool/dbconfig/20221020-182008-ladsgroup.json [18:21:33] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [18:25:36] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [18:26:03] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [18:28:06] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [18:30:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P35714 and previous config saved to /var/cache/conftool/dbconfig/20221020-183026-ladsgroup.json [18:30:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P35715 and previous config saved to /var/cache/conftool/dbconfig/20221020-183040-ladsgroup.json [18:35:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P35716 and previous config saved to /var/cache/conftool/dbconfig/20221020-183515-ladsgroup.json [18:36:53] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [18:37:29] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [18:39:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:42:09] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [18:45:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P35717 and previous config saved to /var/cache/conftool/dbconfig/20221020-184533-ladsgroup.json [18:45:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T318950)', diff saved to https://phabricator.wikimedia.org/P35718 and previous config saved to /var/cache/conftool/dbconfig/20221020-184547-ladsgroup.json [18:45:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:45:52] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [18:46:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:46:54] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [18:49:24] !log robh@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4029.ulsfo.wmnet [18:50:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35719 and previous config saved to /var/cache/conftool/dbconfig/20221020-185021-ladsgroup.json [18:52:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P35720 and previous config saved to /var/cache/conftool/dbconfig/20221020-185236-ladsgroup.json [18:53:03] hi TheresNoTime, I wonder if those patches you synced means I could put https://gerrit.wikimedia.org/r/808424 on the deploy calendar? [18:53:23] or is there still any amendment needed [18:55:35] koi: well 815835 needs to ride the train, so maybe wait to schedule that for a little while [18:56:34] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:58:51] got it, thanks for reminding [19:00:34] hmm what happened to wikibugs [19:00:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T318955)', diff saved to https://phabricator.wikimedia.org/P35721 and previous config saved to /var/cache/conftool/dbconfig/20221020-190039-ladsgroup.json [19:00:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:00:45] * sukhe will read the docs later to restart it [19:00:45] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [19:00:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:01:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P35722 and previous config saved to /var/cache/conftool/dbconfig/20221020-190101-ladsgroup.json [19:02:21] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS buster [19:07:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P35723 and previous config saved to /var/cache/conftool/dbconfig/20221020-190743-ladsgroup.json [19:12:47] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS buster [19:13:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS buster [19:13:25] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:13:26] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4029.ulsfo.wmnet [19:13:48] !log robh@cumin2002 START - Cookbook sre.dns.netbox [19:15:55] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:16:02] hm, my restart of wikibugs was more of a... kill [19:16:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P35724 and previous config saved to /var/cache/conftool/dbconfig/20221020-191624-ladsgroup.json [19:16:29] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [19:16:47] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2002.codfw.wmnet [19:17:00] TheresNoTime: reminder, wikibugs only joins a channel when it has a message to send there [19:17:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [19:18:56] taavi: I think something else is up :/ T321342 [19:18:57] T321342: wikibugs losing connection to IRC - https://phabricator.wikimedia.org/T321342 [19:20:03] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:20:18] (ProbeDown) firing: Service thanos-swift:443 has failed probes (http_thanos-swift_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-swift:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:21:27] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:22:31] ^ herron [19:22:34] IIRC you were working on this? [19:22:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [19:22:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P35725 and previous config saved to /var/cache/conftool/dbconfig/20221020-192249-ladsgroup.json [19:22:55] sukhe: yes that's me [19:23:09] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:23:18] thanks, was just checking [19:23:20] huh I thought reboot single depooled the host but I guess not [19:23:28] it happens :P [19:23:36] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2002.codfw.wmnet [19:25:13] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:25:18] (ProbeDown) resolved: (3) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:26:28] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2001.codfw.wmnet [19:27:52] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS buster [19:28:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS buster [19:29:43] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: apply updates - bking@cumin2002 - T321310 [19:30:15] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [19:30:33] (ProbeDown) firing: (5) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:30:33] (ProbeDown) firing: (5) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:30:35] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:31:17] herron: https://github.com/wikimedia/operations-cookbooks/blob/f1535c82ec28f3e1f30fa4ad660c2cd8353edbdf/cookbooks/sre/hosts/reboot-single.py#L23 [19:31:23] doesn't seem like this supports depooling [19:31:26] so you have to do that manually [19:31:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P35726 and previous config saved to /var/cache/conftool/dbconfig/20221020-193130-ladsgroup.json [19:31:33] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1437 days) https://wikitech.wikimedia.org/wiki/Logs [19:31:37] (yet) lol [19:32:00] thanks sukhe TIL, thinking of patching that tbh [19:32:06] :D [19:33:37] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2001.codfw.wmnet [19:35:33] (ProbeDown) resolved: (5) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:35:43] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 149 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 153, active_shards: 153, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 149, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, num [19:35:43] n_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.66225165562914 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:35:47] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 5, active_shards: 5, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, [19:35:47] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:37:39] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS buster [19:37:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P35727 and previous config saved to /var/cache/conftool/dbconfig/20221020-193756-ladsgroup.json [19:38:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:38:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:39:25] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [19:41:56] !log dzahn@cumin2002 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:codfw and (A:gitlab-runner) [19:42:24] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: apply updates - bking@cumin2002 - T321310 [19:42:32] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: apply updates - bking@cumin2002 - T321310 [19:42:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [19:42:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [19:42:54] !log rebooting gitlab-runners in codfw [19:42:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P35728 and previous config saved to /var/cache/conftool/dbconfig/20221020-194257-ladsgroup.json [19:43:55] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 153, active_shards: 306, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [19:43:55] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:44:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35729 and previous config saved to /var/cache/conftool/dbconfig/20221020-194425-ladsgroup.json [19:46:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P35730 and previous config saved to /var/cache/conftool/dbconfig/20221020-194637-ladsgroup.json [19:48:26] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet [19:50:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS buster [19:53:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P35731 and previous config saved to /var/cache/conftool/dbconfig/20221020-195331-ladsgroup.json [19:57:01] (03PS1) 10Hoo man: Only generate QS maxlag for pooled servers [extensions/Wikidata.org] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845016 (https://phabricator.wikimedia.org/T315423) [19:57:18] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1003.eqiad.wmnet [19:57:22] !log robh@cumin2002 START - Cookbook sre.dns.netbox [19:58:30] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10RobH) [19:59:17] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:59:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [20:00:00] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4047.mgmt.ulsfo.wmnet with reboot policy FORCED [20:00:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [20:00:05] brennen and TheresNoTime: #bothumor I � Unicode. All rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221020T2000). [20:00:05] Jdlrobson, TheresNoTime, MatmaRex, and hoo: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] sukhe: well how about that, support is there just defaulted to false https://github.com/wikimedia/operations-cookbooks/blob/f1535c82ec28f3e1f30fa4ad660c2cd8353edbdf/cookbooks/sre/hosts/reboot-single.py#L44-L45 [20:00:20] * TheresNoTime can deploy! [20:00:26] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4049.mgmt.ulsfo.wmnet with reboot policy FORCED [20:01:00] present [20:01:01] TheresNoTime: can I steal something for training again? :) [20:01:25] thcipriani: sure :) want to do the first in the list, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/843995 ? [20:01:29] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:33] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 5, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, [20:01:33] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:01:36] hi [20:01:39] TheresNoTime: absolutely thank you! [20:01:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P35732 and previous config saved to /var/cache/conftool/dbconfig/20221020-200143-ladsgroup.json [20:01:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance [20:01:50] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [20:01:55] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:codfw and (A:gitlab-runner) [20:01:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance [20:02:03] thcipriani: I'll wait to hear :) Jdlrobson, thcipriani is deploying your patch [20:02:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T318955)', diff saved to https://phabricator.wikimedia.org/P35733 and previous config saved to /var/cache/conftool/dbconfig/20221020-200205-ladsgroup.json [20:02:30] o/ [20:02:36] kindrobot: o/ [20:03:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [20:03:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [20:03:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T321312)', diff saved to https://phabricator.wikimedia.org/P35734 and previous config saved to /var/cache/conftool/dbconfig/20221020-200321-ladsgroup.json [20:05:01] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1003.eqiad.wmnet [20:05:28] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS buster [20:07:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843995 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:07:45] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1002.eqiad.wmnet [20:08:01] (03Merged) 10jenkins-bot: Updates to Wikipedia wordmark/taglines and project icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843995 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:08:14] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:843995|Updates to Wikipedia wordmark/taglines and project icons (T319223)]] [20:08:22] T319223: [XL] Deploy new set of logos for Wikipedias - https://phabricator.wikimedia.org/T319223 [20:08:35] !log thcipriani@deploy1002 thcipriani and jdlrobson: Backport for [[gerrit:843995|Updates to Wikipedia wordmark/taglines and project icons (T319223)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:08:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P35735 and previous config saved to /var/cache/conftool/dbconfig/20221020-200838-ladsgroup.json [20:09:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS buster [20:09:31] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:09:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T321312)', diff saved to https://phabricator.wikimedia.org/P35736 and previous config saved to /var/cache/conftool/dbconfig/20221020-200947-ladsgroup.json [20:10:05] Jdlrobson: your change is live on mwdebug, check please [20:10:48] thcipriani: checking [20:10:51] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:11:23] (03PS1) 10JHathaway: aux_k8s_etcd: initial etcd config [puppet] - 10https://gerrit.wikimedia.org/r/845047 [20:11:27] (03CR) 10CI reject: [V: 04-1] Only generate QS maxlag for pooled servers [extensions/Wikidata.org] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845016 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man) [20:11:52] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/845047 (owner: 10JHathaway) [20:12:08] (03CR) 10Hoo man: "recheck" [extensions/Wikidata.org] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845016 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man) [20:12:30] thcipriani: looking good but I'd like a little more time if that's okay? [20:12:53] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:13:06] Jdlrobson: seems fine, how much time do you need? [20:13:55] maybe another 5 mins? [20:14:19] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4047.mgmt.ulsfo.wmnet with reboot policy FORCED [20:14:19] ok, sure ^ FYI TheresNoTime [20:14:22] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4049.mgmt.ulsfo.wmnet with reboot policy FORCED [20:14:43] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:15] thcipriani: ack, am I okay to set 845015 merging or would you rather I waited? [20:15:49] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:55] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1002.eqiad.wmnet [20:16:11] thcipriani: feel free to sync [20:16:13] looking good [20:17:09] cool, going live [20:17:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T318955)', diff saved to https://phabricator.wikimedia.org/P35737 and previous config saved to /var/cache/conftool/dbconfig/20221020-201748-ladsgroup.json [20:17:54] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [20:18:43] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:19:20] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [20:20:01] (03CR) 10JHathaway: [C: 03+2] aux_k8s_etcd: initial etcd config [puppet] - 10https://gerrit.wikimedia.org/r/845047 (owner: 10JHathaway) [20:20:58] !log dzahn@cumin2002 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:eqiad and (A:gitlab-runner) [20:21:00] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [20:21:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:22:03] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:17] MatmaRex: your patch will be next FYI [20:22:20] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1001.eqiad.wmnet [20:22:21] (03CR) 10Jdlrobson: "Stang do you have any idea why no_wordmark is not working? I think there's an issue in the Python script." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845035 (owner: 10Jdlrobson) [20:22:22] thanks [20:22:27] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:843995|Updates to Wikipedia wordmark/taglines and project icons (T319223)]] (duration: 14m 12s) [20:22:32] T319223: [XL] Deploy new set of logos for Wikipedias - https://phabricator.wikimedia.org/T319223 [20:22:42] ^ Jdlrobson should be live now [20:22:58] TheresNoTime: thank you for letting me interrupt, all yours again :) [20:23:04] thcipriani: no worries :) [20:23:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845015 (https://phabricator.wikimedia.org/T321185) (owner: 10Bartosz Dziewoński) [20:23:23] thcipriani: thank you! [20:23:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P35739 and previous config saved to /var/cache/conftool/dbconfig/20221020-202344-ladsgroup.json [20:24:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P35740 and previous config saved to /var/cache/conftool/dbconfig/20221020-202454-ladsgroup.json [20:25:41] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS buster [20:25:51] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [20:26:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:27:32] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [20:29:40] (03Merged) 10jenkins-bot: ReplyLinksController: Skip empty reply buttons container [extensions/DiscussionTools] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845015 (https://phabricator.wikimedia.org/T321185) (owner: 10Bartosz Dziewoński) [20:29:53] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) When attempting to run the provsion script for both ganeti4005 and lvs4008 I get the same error: ` Testing Redfish API connection to lvs4008 (10... [20:29:58] !log samtar@deploy1002 Started scap: Backport for [[gerrit:845015|ReplyLinksController: Skip empty reply buttons container (T321185)]] [20:30:18] !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:845015|ReplyLinksController: Skip empty reply buttons container (T321185)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:30:21] MatmaRex: live on mwdebug, can you test please? :) [20:30:40] yeah [20:30:46] RECOVERY - SSH on db1121.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:04] (03PS1) 10JHathaway: aux_k8s_etcd: assign role to servers [puppet] - 10https://gerrit.wikimedia.org/r/845050 [20:31:08] and looks good [20:31:28] syncin' [20:32:18] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/845050 (owner: 10JHathaway) [20:32:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P35741 and previous config saved to /var/cache/conftool/dbconfig/20221020-203255-ladsgroup.json [20:33:45] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1001.eqiad.wmnet [20:34:17] hoo: it'll be your patch next FYI [20:34:38] Thanks... can I self-service? [20:35:12] hoo: sure, I'll let you know when this has finished syncing, and I've got one to self-serve myself after you :) [20:35:22] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:845015|ReplyLinksController: Skip empty reply buttons container (T321185)]] (duration: 05m 23s) [20:35:24] Sounds good to me [20:35:27] T321185: Error: Widget not found - DiscussionTools try to kicks in when previewing the edit - https://phabricator.wikimedia.org/T321185 [20:35:30] I'll hit +2 already, tests take ages here [20:35:45] (03CR) 10Hoo man: [C: 03+2] Only generate QS maxlag for pooled servers [extensions/Wikidata.org] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845016 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man) [20:36:07] MatmaRex: that's live now :) [20:36:20] thanks [20:36:39] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2004.codfw.wmnet [20:36:43] hoo: all yours, give me a ping when you're done? :) [20:36:51] Will do, thanks :) [20:36:54] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host graphite2004.codfw.wmnet [20:37:14] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2004.codfw.wmnet [20:38:27] (03CR) 10JHathaway: [C: 03+2] aux_k8s_etcd: assign role to servers [puppet] - 10https://gerrit.wikimedia.org/r/845050 (owner: 10JHathaway) [20:38:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P35742 and previous config saved to /var/cache/conftool/dbconfig/20221020-203851-ladsgroup.json [20:40:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P35743 and previous config saved to /var/cache/conftool/dbconfig/20221020-204000-ladsgroup.json [20:40:53] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:eqiad and (A:gitlab-runner) [20:44:30] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2004.codfw.wmnet [20:45:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35744 and previous config saved to /var/cache/conftool/dbconfig/20221020-204506-ladsgroup.json [20:45:11] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2003.codfw.wmnet [20:46:47] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS buster [20:47:38] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Hi @Volans - we have one thought (and we're totally open to feedback, pros, cons, etc) that's somewhat tied to T310594. What if we were to change the status of... [20:48:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P35745 and previous config saved to /var/cache/conftool/dbconfig/20221020-204801-ladsgroup.json [20:48:49] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [20:48:54] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: apply updates - bking@cumin2002 - T321310 [20:49:07] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [20:51:54] (03Merged) 10jenkins-bot: Only generate QS maxlag for pooled servers [extensions/Wikidata.org] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845016 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man) [20:52:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hoo@deploy1002 using scap backport" [extensions/Wikidata.org] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845016 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man) [20:52:25] !log hoo@deploy1002 Started scap: Backport for [[gerrit:845016|Only generate QS maxlag for pooled servers (T315423 T238751)]] [20:52:28] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [20:52:28] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:52:44] !log hoo@deploy1002 hoo and hoo: Backport for [[gerrit:845016|Only generate QS maxlag for pooled servers (T315423 T238751)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:52:58] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2003.codfw.wmnet [20:53:15] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: apply updates - bking@cumin2002 - T321310 [20:53:27] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1005.eqiad.wmnet [20:53:32] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10SDunlap) [20:54:28] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:55:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T321312)', diff saved to https://phabricator.wikimedia.org/P35746 and previous config saved to /var/cache/conftool/dbconfig/20221020-205507-ladsgroup.json [20:55:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [20:55:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [20:55:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T321312)', diff saved to https://phabricator.wikimedia.org/P35747 and previous config saved to /var/cache/conftool/dbconfig/20221020-205532-ladsgroup.json [20:55:42] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10SDunlap) [20:56:08] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:57:53] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10Dzahn) Hello, yes, confirmed your username "kindrobot" already is in LDAP and in the 'wmf' group. It does not have shell access yet though. So the answer is yes and no. This is... [20:57:58] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:58:10] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:58:15] (03PS1) 10JHathaway: aux_k8s_etcd: fix typo in discovery name [puppet] - 10https://gerrit.wikimedia.org/r/845051 (https://phabricator.wikimedia.org/T321134) [20:58:35] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10RhinosF1) Note: Attended deployment training in T320725 [20:59:18] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/845051 (https://phabricator.wikimedia.org/T321134) (owner: 10JHathaway) [20:59:37] !log hoo@deploy1002 Finished scap: Backport for [[gerrit:845016|Only generate QS maxlag for pooled servers (T315423 T238751)]] (duration: 07m 12s) [20:59:43] T238751: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 [20:59:44] T315423: Revive and merge patch to update maxlag calculation - https://phabricator.wikimedia.org/T315423 [21:00:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P35748 and previous config saved to /var/cache/conftool/dbconfig/20221020-210012-ladsgroup.json [21:00:24] TheresNoTime: I'm all done [21:00:30] hoo: thanks :) [21:00:40] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1005.eqiad.wmnet [21:00:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/PageTriage] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845014 (https://phabricator.wikimedia.org/T320543) (owner: 10Samtar) [21:01:04] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet [21:01:18] !log extending UTC late backport window [21:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:36] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10SDunlap) cc @Jrbranaa (manager) Could I please get your approval? [21:01:56] (03CR) 10JHathaway: [C: 03+2] aux_k8s_etcd: fix typo in discovery name [puppet] - 10https://gerrit.wikimedia.org/r/845051 (https://phabricator.wikimedia.org/T321134) (owner: 10JHathaway) [21:02:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T321312)', diff saved to https://phabricator.wikimedia.org/P35749 and previous config saved to /var/cache/conftool/dbconfig/20221020-210205-ladsgroup.json [21:03:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T318955)', diff saved to https://phabricator.wikimedia.org/P35750 and previous config saved to /var/cache/conftool/dbconfig/20221020-210308-ladsgroup.json [21:03:13] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [21:03:41] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:04:17] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10Dzahn) @SDunlap Please also make yourself familiar with L3 and sign it here on Phabricator. Thank you [21:04:31] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:04:36] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: apply updates - bking@cumin2002 - T321310 [21:05:09] 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 5 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10hoo) 05Stalled→03Resolved a:03hoo [21:05:25] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:15] PROBLEM - Etcd cluster health on aux-k8s-etcd1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.21: Connection reset by peer https://wikitech.wikimedia.org/wiki/Etcd [21:06:19] (03PS1) 10Ssingh: hiera: update cp4037.yaml [puppet] - 10https://gerrit.wikimedia.org/r/845053 [21:06:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [21:06:29] (03PS1) 10Ryan Kemper: elastic: don't block on /root/allow_es7 existing [puppet] - 10https://gerrit.wikimedia.org/r/845054 (https://phabricator.wikimedia.org/T308676) [21:07:24] (03CR) 10Ssingh: [C: 03+2] hiera: update cp4037.yaml [puppet] - 10https://gerrit.wikimedia.org/r/845053 (owner: 10Ssingh) [21:07:27] (03Merged) 10jenkins-bot: statsv: Add error counters to delete/tags .js [extensions/PageTriage] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845014 (https://phabricator.wikimedia.org/T320543) (owner: 10Samtar) [21:07:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [21:07:45] !log samtar@deploy1002 Started scap: Backport for [[gerrit:845014|statsv: Add error counters to delete/tags .js (T320543)]] [21:07:49] T320543: Track error counts using statsv - https://phabricator.wikimedia.org/T320543 [21:08:00] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1004.eqiad.wmnet [21:08:04] !log samtar@deploy1002 samtar and samtar: Backport for [[gerrit:845014|statsv: Add error counters to delete/tags .js (T320543)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:08:18] (03PS1) 10Ryan Kemper: elastic: no more /root/allow_es7 [cookbooks] - 10https://gerrit.wikimedia.org/r/845055 (https://phabricator.wikimedia.org/T308676) [21:08:22] (03CR) 10Bking: [C: 03+1] elastic: don't block on /root/allow_es7 existing [puppet] - 10https://gerrit.wikimedia.org/r/845054 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [21:08:32] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10SDunlap) Read and signed. [21:08:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:09:03] * TheresNoTime testing [21:09:27] (03CR) 10Ryan Kemper: [C: 03+2] elastic: don't block on /root/allow_es7 existing [puppet] - 10https://gerrit.wikimedia.org/r/845054 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [21:09:31] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 153, active_shards: 306, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [21:09:32] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:10:48] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org [21:11:02] (03CR) 10Bking: [C: 03+1] elastic: no more /root/allow_es7 [cookbooks] - 10https://gerrit.wikimedia.org/r/845055 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [21:11:08] (03PS2) 10Andrew Bogott: openstack: make domain-aware [software/cumin] - 10https://gerrit.wikimedia.org/r/845004 (https://phabricator.wikimedia.org/T321349) [21:11:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [21:11:59] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [21:11:59] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:23] (03PS2) 10Bking: elastic: no more /root/allow_es7 [cookbooks] - 10https://gerrit.wikimedia.org/r/845055 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [21:12:32] * TheresNoTime syncs [21:12:54] (03CR) 10Andrew Bogott: openstack: make domain-aware (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/845004 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [21:13:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [21:13:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:14:05] PROBLEM - Check for large files in client bucket on aux-k8s-etcd1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.21: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [21:14:05] PROBLEM - configured eth on aux-k8s-etcd1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.21: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [21:15:09] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org [21:15:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P35751 and previous config saved to /var/cache/conftool/dbconfig/20221020-211519-ladsgroup.json [21:15:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [21:16:11] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:845014|statsv: Add error counters to delete/tags .js (T320543)]] (duration: 08m 26s) [21:16:16] T320543: Track error counts using statsv - https://phabricator.wikimedia.org/T320543 [21:16:25] !log close UTC late window [21:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:16:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [21:17:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P35752 and previous config saved to /var/cache/conftool/dbconfig/20221020-211712-ladsgroup.json [21:17:19] (03PS1) 10Daniel Kinzler: Set VisualEditorDefaultParsoidClient for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845058 (https://phabricator.wikimedia.org/T320531) [21:17:24] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [21:17:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [21:18:30] PROBLEM - Check systemd state on aux-k8s-etcd1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.21: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:30] PROBLEM - etcd service on aux-k8s-etcd1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.21: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:19:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [21:20:02] PROBLEM - Check the NTP synchronisation status of timesyncd on aux-k8s-etcd1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.21: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [21:20:02] PROBLEM - puppet last run on aux-k8s-etcd1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.21: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:21:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [21:21:22] PROBLEM - Check unit status of etcd-backup on aux-k8s-etcd1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.21: Connection reset by peer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:21:39] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for GFontenelle - https://phabricator.wikimedia.org/T321218 (10Aklapper) 05Resolved→03Open Reopening per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group [21:24:33] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for GFontenelle - https://phabricator.wikimedia.org/T321218 (10herron) 05Open→03Resolved >>! In T321218#8334262, @Aklapper wrote: > Reopening per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group Done [21:24:38] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10Papaul) out put after running the cookbook on lvs4008 ` END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs4008.mgmt.ulsfo.wmnet with... [21:25:16] PROBLEM - Check whether ferm is active by checking the default input chain on aux-k8s-etcd1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.21: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:25:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [21:25:50] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [21:25:50] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:28:06] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:29] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10thcipriani) >>! In T321355#8334140, @Dzahn wrote: > Meanwhile adding @thcipriani for approval for additions to the deployment group. Approved! [21:30:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35753 and previous config saved to /var/cache/conftool/dbconfig/20221020-213025-ladsgroup.json [21:30:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [21:30:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [21:30:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T321312)', diff saved to https://phabricator.wikimedia.org/P35754 and previous config saved to /var/cache/conftool/dbconfig/20221020-213050-ladsgroup.json [21:31:15] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: eqiad: (3) VMs requested for aux-k8s-etcd - https://phabricator.wikimedia.org/T321134 (10jhathaway) etcd is up! ` $ etcdctl -C https://$(hostname -f):2379 cluster-health member 9bcd201aa17309ef is healthy: got healthy result from http... [21:32:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P35755 and previous config saved to /var/cache/conftool/dbconfig/20221020-213218-ladsgroup.json [21:32:56] PROBLEM - Check size of conntrack table on aux-k8s-etcd1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.21: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:33:10] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on aux-k8s-etcd1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.21: Connection reset by peer https://wikitech.wikimedia.org/wiki/Microcode [21:33:10] PROBLEM - dhclient process on aux-k8s-etcd1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.21: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:34:22] RECOVERY - Check size of conntrack table on aux-k8s-etcd1001 is OK: OK: nf_conntrack is 2 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:34:22] RECOVERY - Check for large files in client bucket on aux-k8s-etcd1001 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [21:34:22] RECOVERY - Check systemd state on aux-k8s-etcd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:22] RECOVERY - etcd service on aux-k8s-etcd1001 is OK: OK - etcd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:34:22] RECOVERY - Etcd cluster health on aux-k8s-etcd1001 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [21:35:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [21:37:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T321312)', diff saved to https://phabricator.wikimedia.org/P35756 and previous config saved to /var/cache/conftool/dbconfig/20221020-213709-ladsgroup.json [21:38:26] RECOVERY - puppet last run on aux-k8s-etcd1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:39:28] RECOVERY - Check the NTP synchronisation status of timesyncd on aux-k8s-etcd1001 is OK: OK: synced at Thu 2022-10-20 21:39:26 UTC. https://wikitech.wikimedia.org/wiki/NTP [21:40:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [21:41:38] RECOVERY - dhclient process on aux-k8s-etcd1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:41:39] RECOVERY - Check unit status of etcd-backup on aux-k8s-etcd1001 is OK: OK: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:41:39] RECOVERY - configured eth on aux-k8s-etcd1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [21:41:39] RECOVERY - Check whether ferm is active by checking the default input chain on aux-k8s-etcd1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:41:40] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on aux-k8s-etcd1001 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [21:42:09] (03PS6) 10Jdlrobson: WIP: Logo tooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680 [21:42:21] (03PS3) 10Jdlrobson: DONOTMERGE: Proof of concept for batch updating DI wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839700 (https://phabricator.wikimedia.org/T319223) [21:42:21] (03PS1) 10Jdlrobson: Promote several Wikipedias to desktop improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845060 (https://phabricator.wikimedia.org/T319223) [21:42:23] (03PS2) 10Jdlrobson: Promote several Wikipedias to desktop improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845060 (https://phabricator.wikimedia.org/T319223) [21:42:51] (03CR) 10CI reject: [V: 04-1] WIP: Logo tooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839680 (owner: 10Jdlrobson) [21:43:05] (03CR) 10CI reject: [V: 04-1] Promote several Wikipedias to desktop improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845060 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [21:43:07] (03CR) 10CI reject: [V: 04-1] DONOTMERGE: Proof of concept for batch updating DI wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839700 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [21:43:51] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: eqiad: (3) VMs requested for aux-k8s-etcd - https://phabricator.wikimedia.org/T321134 (10jhathaway) 05Open→03Resolved [21:47:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T321312)', diff saved to https://phabricator.wikimedia.org/P35757 and previous config saved to /var/cache/conftool/dbconfig/20221020-214725-ladsgroup.json [21:47:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [21:47:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [21:47:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T321312)', diff saved to https://phabricator.wikimedia.org/P35758 and previous config saved to /var/cache/conftool/dbconfig/20221020-214750-ladsgroup.json [21:47:56] (03PS1) 10Ssingh: cp4029: decommission host [puppet] - 10https://gerrit.wikimedia.org/r/845061 [21:52:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P35759 and previous config saved to /var/cache/conftool/dbconfig/20221020-215216-ladsgroup.json [21:54:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T321312)', diff saved to https://phabricator.wikimedia.org/P35760 and previous config saved to /var/cache/conftool/dbconfig/20221020-215418-ladsgroup.json [21:55:38] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T321310 [21:57:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4037.ulsfo.wmnet with OS buster [21:58:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [22:03:02] PROBLEM - Check systemd state on elastic1097 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:07:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P35761 and previous config saved to /var/cache/conftool/dbconfig/20221020-220722-ladsgroup.json [22:09:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P35762 and previous config saved to /var/cache/conftool/dbconfig/20221020-220924-ladsgroup.json [22:09:43] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10Papaul) So the issue with ganeti4005 was that the bios boot setting was set to UEFI that is the reason RedFish was failing so after I changed it to BIOS... [22:10:46] PROBLEM - Check systemd state on elastic1101 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:44] PROBLEM - Check systemd state on elastic1092 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:48] PROBLEM - Check systemd state on elastic1091 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:22:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T321312)', diff saved to https://phabricator.wikimedia.org/P35763 and previous config saved to /var/cache/conftool/dbconfig/20221020-222229-ladsgroup.json [22:22:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [22:22:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [22:22:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T321312)', diff saved to https://phabricator.wikimedia.org/P35764 and previous config saved to /var/cache/conftool/dbconfig/20221020-222253-ladsgroup.json [22:23:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:24:30] PROBLEM - Check systemd state on elastic1090 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:24:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P35765 and previous config saved to /var/cache/conftool/dbconfig/20221020-222431-ladsgroup.json [22:25:54] PROBLEM - Check systemd state on elastic1089 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:44] RECOVERY - Check systemd state on elastic1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:44] RECOVERY - Check systemd state on elastic1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:28:18] PROBLEM - Check systemd state on ml-serve1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T321312)', diff saved to https://phabricator.wikimedia.org/P35766 and previous config saved to /var/cache/conftool/dbconfig/20221020-222903-ladsgroup.json [22:29:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:33:14] PROBLEM - Check systemd state on elastic1070 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:52] PROBLEM - Check systemd state on elastic1053 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:36] RECOVERY - Check systemd state on elastic1090 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:08] RECOVERY - Check systemd state on elastic1097 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:18] PROBLEM - Check systemd state on elastic1072 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:08] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1053 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:36:19] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1001.eqiad.wmnet [22:37:21] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10Jrbranaa) Approved [22:39:32] RECOVERY - Check systemd state on elastic1070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T321312)', diff saved to https://phabricator.wikimedia.org/P35767 and previous config saved to /var/cache/conftool/dbconfig/20221020-223937-ladsgroup.json [22:39:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [22:39:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [22:40:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T321312)', diff saved to https://phabricator.wikimedia.org/P35768 and previous config saved to /var/cache/conftool/dbconfig/20221020-224003-ladsgroup.json [22:41:06] PROBLEM - Check systemd state on elastic1069 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:10] PROBLEM - Check systemd state on elastic1084 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:34] PROBLEM - Check systemd state on elastic1071 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:10] RECOVERY - Check systemd state on elastic1092 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt-wdqs1001.eqiad.wmnet [22:43:31] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1002.eqiad.wmnet [22:43:41] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1003.eqiad.wmnet [22:44:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P35769 and previous config saved to /var/cache/conftool/dbconfig/20221020-224409-ladsgroup.json [22:47:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T321312)', diff saved to https://phabricator.wikimedia.org/P35770 and previous config saved to /var/cache/conftool/dbconfig/20221020-224736-ladsgroup.json [22:48:03] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt-wdqs1002.eqiad.wmnet [22:49:10] PROBLEM - Check systemd state on elastic1082 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:28] PROBLEM - Check systemd state on elastic1088 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:33] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt-wdqs1003.eqiad.wmnet [22:49:34] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt2001-dev.codfw.wmnet [22:49:42] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt2002-dev.codfw.wmnet [22:50:07] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt2003-dev.codfw.wmnet [22:51:04] PROBLEM - Check systemd state on elastic1059 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:40] RECOVERY - Check systemd state on elastic1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:38] RECOVERY - Check systemd state on elastic1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:10] PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:58:10] PROBLEM - Check systemd state on elastic1087 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:59:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P35771 and previous config saved to /var/cache/conftool/dbconfig/20221020-225916-ladsgroup.json [22:59:22] PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:59:40] RECOVERY - Check systemd state on elastic1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:04] RECOVERY - Check systemd state on elastic1072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:02] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [23:02:04] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3062 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [23:02:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221020-230242-ladsgroup.json [23:02:56] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [23:03:19] (ProbeDown) firing: (6) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:03:19] (ProbeDown) firing: (17) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:03:38] was down for a few moments there, ~back now [23:04:12] RECOVERY - Check systemd state on elastic1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:34] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1059 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:04:38] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1079.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1079.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: testlb6_443: Serve [23:04:38] 1.eqiad.wmnet, cp1079.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:04:46] PROBLEM - Check systemd state on elastic1085 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:46] PROBLEM - Check systemd state on elastic1077 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:14] RECOVERY - Check systemd state on elastic1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:14] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:06:16] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [23:06:16] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [23:06:26] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1079.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1079.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [23:06:50] PROBLEM - Check systemd state on elastic1078 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:14] RECOVERY - Check whether ferm is active by checking the default input chain on elastic1053 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:07:34] RECOVERY - Check systemd state on elastic1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:16] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:08:16] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [23:08:18] (ProbeDown) resolved: (21) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:08:19] (ProbeDown) resolved: (17) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:08:32] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:08:50] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:09:01] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [23:12:00] RECOVERY - Check systemd state on elastic1088 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:50] PROBLEM - Check systemd state on elastic1086 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:04] PROBLEM - Check systemd state on elastic1056 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:46] PROBLEM - Check systemd state on elastic1079 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:01] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [23:14:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T321312)', diff saved to https://phabricator.wikimedia.org/P35772 and previous config saved to /var/cache/conftool/dbconfig/20221020-231422-ladsgroup.json [23:14:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [23:14:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [23:14:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T321312)', diff saved to https://phabricator.wikimedia.org/P35773 and previous config saved to /var/cache/conftool/dbconfig/20221020-231446-ladsgroup.json [23:17:42] RECOVERY - Check systemd state on elastic1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:17:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P35774 and previous config saved to /var/cache/conftool/dbconfig/20221020-231754-ladsgroup.json [23:17:55] RECOVERY - Check systemd state on elastic1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:00] PROBLEM - Check systemd state on elastic1055 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:21:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T321312)', diff saved to https://phabricator.wikimedia.org/P35775 and previous config saved to /var/cache/conftool/dbconfig/20221020-232116-ladsgroup.json [23:21:18] RECOVERY - Check systemd state on elastic1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:33] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [23:23:56] RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:26] RECOVERY - Check systemd state on elastic1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:42] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1056 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:25:31] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [23:25:53] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [23:26:14] RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:27:34] RECOVERY - Check systemd state on elastic1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:29:11] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [23:29:58] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [23:30:20] PROBLEM - Check systemd state on elastic1066 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:22] PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:08] PROBLEM - Check systemd state on elastic1060 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:15] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [23:31:30] RECOVERY - Check systemd state on elastic1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:34] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED [23:32:48] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4005.ulsfo.wmnet with OS bullseye [23:32:57] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye [23:33:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T321312)', diff saved to https://phabricator.wikimedia.org/P35776 and previous config saved to /var/cache/conftool/dbconfig/20221020-233300-ladsgroup.json [23:33:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [23:33:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [23:33:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35777 and previous config saved to /var/cache/conftool/dbconfig/20221020-233325-ladsgroup.json [23:33:48] RECOVERY - Check systemd state on elastic1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P35778 and previous config saved to /var/cache/conftool/dbconfig/20221020-233452-ladsgroup.json [23:35:40] RECOVERY - Check whether ferm is active by checking the default input chain on elastic1059 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:36:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P35779 and previous config saved to /var/cache/conftool/dbconfig/20221020-233623-ladsgroup.json [23:38:03] !log sudo cumin 'A:installserver' 'run-puppet-agent -q' for Gerrit 845074 [23:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:45] RECOVERY - Check systemd state on elastic1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:53] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host ganeti4005.ulsfo.wmnet with OS bullseye [23:38:56] PROBLEM - Check systemd state on elastic1062 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:00] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye executed with errors... [23:39:00] PROBLEM - Check systemd state on elastic1065 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:05] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4005.ulsfo.wmnet with OS bullseye [23:39:14] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye [23:39:32] PROBLEM - Check systemd state on elastic1061 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:41:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35780 and previous config saved to /var/cache/conftool/dbconfig/20221020-234104-ladsgroup.json [23:41:37] !log COMPLETED: sudo cumin 'A:installserver' 'run-puppet-agent -q' for Gerrit 845074 [23:41:38] RECOVERY - Check systemd state on elastic1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster [23:44:02] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1066 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:44:40] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudvirt2002-dev.codfw.wmnet [23:44:41] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudvirt2003-dev.codfw.wmnet [23:44:42] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudvirt2001-dev.codfw.wmnet [23:44:54] RECOVERY - Check systemd state on elastic1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:04] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [23:46:46] PROBLEM - Check systemd state on elastic1067 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:48] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1060 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:48:16] PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:35] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1064 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:49:35] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1063 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:51:20] RECOVERY - Check systemd state on elastic1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:23] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T321310 [23:51:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P35781 and previous config saved to /var/cache/conftool/dbconfig/20221020-235130-ladsgroup.json [23:51:54] RECOVERY - Check systemd state on elastic1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:12] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:54:25] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS buster [23:55:18] RECOVERY - Check systemd state on elastic1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:38] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt2003-dev.codfw.wmnet [23:55:46] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudvirt2003-dev.codfw.wmnet [23:55:48] RECOVERY - Check whether ferm is active by checking the default input chain on elastic1056 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:56:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P35782 and previous config saved to /var/cache/conftool/dbconfig/20221020-235611-ladsgroup.json [23:57:18] RECOVERY - Check systemd state on ml-serve1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:47] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs4008 [23:58:56] !log robh@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs4008