[00:11:18] !log end running skin preference update script T299104 [00:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:23] T299104: Prepare script to update invalid user preferences after skins have been separated - https://phabricator.wikimedia.org/T299104 [00:25:45] (03PS1) 10Andrew Bogott: cloudvirt1024: update nic ids and set legacy_vlan_naming: false [puppet] - 10https://gerrit.wikimedia.org/r/772947 [00:27:06] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1024: update nic ids and set legacy_vlan_naming: false [puppet] - 10https://gerrit.wikimedia.org/r/772947 (owner: 10Andrew Bogott) [00:39:47] RECOVERY - ensure kvm processes are running on cloudvirt1024 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:32:59] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:45] (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:04] Deploy window Automatic 🚂🧪Trainsperiment Week branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T0200) [02:07:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.4 [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/772965 [02:07:28] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.4 [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/772965 (owner: 10TrainBranchBot) [02:07:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:08:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:30] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.4 [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/772965 (owner: 10TrainBranchBot) [02:29:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:29:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:35] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:30:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:47] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:37:29] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [03:05:37] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:46:15] (03CR) 10Cwhite: [C: 03+1] role::kafka::logging: add PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [04:41:07] 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10odimitrijevic) [05:02:29] 10SRE, 10GitLab, 10Horizon, 10wikitech.wikimedia.org, 10Security: Take some pointers from GitHub security updates - https://phabricator.wikimedia.org/T304231 (10hashar) [05:08:11] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:11] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:47:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:53:34] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1158 - https://phabricator.wikimedia.org/T303910 (10Marostegui) Thank you Chris, the RAID is back to optimal [05:57:01] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:02:05] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1132 with low weight T301879', diff saved to https://phabricator.wikimedia.org/P22995 and previous config saved to /var/cache/conftool/dbconfig/20220323-060351-marostegui.json [06:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:58] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [06:05:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 for reimage', diff saved to https://phabricator.wikimedia.org/P22996 and previous config saved to /var/cache/conftool/dbconfig/20220323-060533-marostegui.json [06:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:18] (03PS1) 10Marostegui: db1112: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/773120 (https://phabricator.wikimedia.org/T300600) [06:07:31] (03CR) 10Marostegui: [C: 03+2] db1112: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/773120 (https://phabricator.wikimedia.org/T300600) (owner: 10Marostegui) [06:09:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1112.eqiad.wmnet with OS bullseye [06:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1112.eqiad.wmnet with reason: host reimage [06:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:42] (03CR) 10Marostegui: [C: 03+1] "This looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/771462 (https://phabricator.wikimedia.org/T301674) (owner: 10Zabe) [06:20:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1112.eqiad.wmnet with reason: host reimage [06:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:15] (03CR) 10Marostegui: [C: 03+1] wmcs: stop accessing gu_hidden in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/760953 (https://phabricator.wikimedia.org/T289068) (owner: 10Zabe) [06:24:40] (03CR) 10Marostegui: [C: 03+2] wmcs: stop accessing gu_hidden in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/760953 (https://phabricator.wikimedia.org/T289068) (owner: 10Zabe) [06:24:51] (03CR) 10Marostegui: [C: 03+2] wmcs: stop accessing gu_enabled and gu_enabled_method in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/771462 (https://phabricator.wikimedia.org/T301674) (owner: 10Zabe) [06:25:00] (03PS2) 10Marostegui: wmcs: stop accessing gu_enabled and gu_enabled_method in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/771462 (https://phabricator.wikimedia.org/T301674) (owner: 10Zabe) [06:26:27] (03CR) 10Marostegui: [V: 03+2 C: 03+2] wmcs: stop accessing gu_enabled and gu_enabled_method in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/771462 (https://phabricator.wikimedia.org/T301674) (owner: 10Zabe) [06:34:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1112.eqiad.wmnet with OS bullseye [06:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:49] (03PS1) 10Marostegui: drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) [06:37:10] (03CR) 10jerkins-bot: [V: 04-1] drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) (owner: 10Marostegui) [06:37:21] (03PS2) 10Marostegui: drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) [06:37:39] (03CR) 10jerkins-bot: [V: 04-1] drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) (owner: 10Marostegui) [06:38:41] (03PS3) 10Marostegui: drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) [06:41:36] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:42:01] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:42:45] (03PS1) 10Marostegui: Revert "db1112: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/772895 [06:43:28] (03CR) 10Marostegui: [C: 03+2] Revert "db1112: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/772895 (owner: 10Marostegui) [06:43:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P22997 and previous config saved to /var/cache/conftool/dbconfig/20220323-064353-root.json [06:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:01] (CirrusSearchHighOldGCFrequency) resolved: (3) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:49:40] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:58:14] 10ops-eqiad, 10serviceops: mc1053 PS redundancy alert - https://phabricator.wikimedia.org/T304477 (10elukey) [06:58:53] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:58:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P22998 and previous config saved to /var/cache/conftool/dbconfig/20220323-065856-root.json [06:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1, awight, Urbanecm, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:25] indeed, nothing to do [07:01:22] (03PS4) 10Elukey: Initial debianization of istio-cni [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) [07:02:47] (03CR) 10Elukey: Initial debianization of istio-cni (033 comments) [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [07:14:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P22999 and previous config saved to /var/cache/conftool/dbconfig/20220323-071400-root.json [07:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P23000 and previous config saved to /var/cache/conftool/dbconfig/20220323-072904-root.json [07:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:26] (03CR) 10Elukey: [C: 03+1] "LGTM! We can deploy anytime" [deployment-charts] - 10https://gerrit.wikimedia.org/r/772811 (https://phabricator.wikimedia.org/T300270) (owner: 10AikoChou) [07:44:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P23001 and previous config saved to /var/cache/conftool/dbconfig/20220323-074408-root.json [07:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:52] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1009 [puppet] - 10https://gerrit.wikimedia.org/r/773181 (https://phabricator.wikimedia.org/T300744) [07:48:53] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1009 [puppet] - 10https://gerrit.wikimedia.org/r/773181 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [07:54:35] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1009.eqiad.wmnet with OS bullseye [07:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] dancy, hashar, brennen, dduvall, jeena, and jnuche: Time to snap out of that daydream and deploy 🚂🧪Trainsperiment Week Deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T0800). [08:00:05] dancy, hashar, brennen, dduvall, jeena, and jnuche: (Dis)respected human, time to deploy 🚂🧪Trainsperiment Week Deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T0800). Please do the needful. [08:00:13] (03PS14) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [08:03:58] (KubernetesCalicoDown) firing: kubernetes1009.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:04:04] 10SRE, 10SRE Observability: thanos: 404 error trying to fetch js library - https://phabricator.wikimedia.org/T269000 (10fgiunchedi) 05Open→03Declined Declining because this is indeed harmless and we're not looking at having sourcemaps for thanos [08:05:01] (03PS15) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [08:06:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34503/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey) [08:09:11] 10SRE, 10observability, 10SRE Observability (FY2021/2022-Q3), 10Sustainability (Incident Followup), 10User-fgiunchedi: Unquoted URL parameter - https://phabricator.wikimedia.org/T304323 (10fgiunchedi) [08:09:23] 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), and 2 others: Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10fgiunchedi) [08:10:06] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1009.eqiad.wmnet with reason: host reimage [08:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:20] (03PS16) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [08:12:51] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1009.eqiad.wmnet with reason: host reimage [08:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:03] (03PS17) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [08:16:42] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34505/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey) [08:19:24] (03PS18) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [08:20:28] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34506/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey) [08:21:40] (03Abandoned) 10Razzi: kafka-main: add kafka-main200[45] to the codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/520465 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [08:22:08] (03PS19) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [08:23:17] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34507/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey) [08:23:58] (KubernetesCalicoDown) resolved: kubernetes1009.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:24:43] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1009.eqiad.wmnet with OS bullseye [08:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:06] (03PS8) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) [08:27:38] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:27:53] (03PS20) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [08:28:55] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34508/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey) [08:29:58] (03PS2) 10MMandere: site: Reimage cp1079 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772793 (https://phabricator.wikimedia.org/T290005) [08:31:06] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:36:33] !log depool cp1079 for reimage - T290005 [08:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:38] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:40:18] (03PS1) 10Hashar: mediawiki::php::monitoring: dupe def PHP_VERSION [puppet] - 10https://gerrit.wikimedia.org/r/773184 (https://phabricator.wikimedia.org/T301945) [08:40:39] (03PS21) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [08:40:41] (03PS1) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [08:41:10] (03CR) 10Hashar: "That should remove 70k/h php notices from logstash ;)" [puppet] - 10https://gerrit.wikimedia.org/r/773184 (https://phabricator.wikimedia.org/T301945) (owner: 10Hashar) [08:41:35] (03CR) 10jerkins-bot: [V: 04-1] WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 (owner: 10Elukey) [08:42:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34509/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:43:08] (03PS9) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) [08:43:48] !log installing openssl security updates [08:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:20] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:02] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:45:28] (03PS2) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [08:46:31] (03CR) 10jerkins-bot: [V: 04-1] WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 (owner: 10Elukey) [08:46:34] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:47:44] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1079.eqiad.wmnet with OS buster [08:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:52] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1079.eqiad.wmnet with OS buster [08:48:06] (03PS3) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [08:49:09] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34512/console" [puppet] - 10https://gerrit.wikimedia.org/r/773185 (owner: 10Elukey) [08:50:44] (03CR) 10Elukey: [V: 03+1] "This is an example of how the istio-cni config could be easily chained:" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:51:29] 10Puppet, 10Infrastructure-Foundations: Search broken on puppetboard - https://phabricator.wikimedia.org/T304484 (10ayounsi) p:05Triage→03Low [08:51:33] !log mmandere@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1079.eqiad.wmnet with OS buster [08:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:41] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1079.eqiad.wmnet with OS buster exe... [08:51:49] (03CR) 10MMandere: [C: 03+2] site: Reimage cp1079 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772793 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [08:54:04] !log restarting spamassassin/clamav on otrs1001/ticket.wikimedia.org [08:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:41] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1079.eqiad.wmnet with OS buster [08:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:49] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1079.eqiad.wmnet with OS buster [08:57:11] (03CR) 10Elukey: [C: 03+1] "I am not super familiar with the scaffold/etc.. configs but LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/770556 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:59:24] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:50] (03PS1) 10MMandere: site: Reimage cp1081 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773188 (https://phabricator.wikimedia.org/T290005) [09:03:34] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:04:33] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [09:06:01] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:49] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp1081 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773188 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [09:09:42] (03CR) 10Elukey: [C: 03+1] Move miscweb from it's own LVS VIP to k8s-ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/770506 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:10:44] (03CR) 10Elukey: "Looks good, I'd also ask a quick review to Traffic for confirmation/awareness of the change." [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:11:18] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1079.eqiad.wmnet with reason: host reimage [09:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:12] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1079.eqiad.wmnet with reason: host reimage [09:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:09] PROBLEM - Check systemd state on db1169 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:33] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:50] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for uwsgi/graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/773190 (https://phabricator.wikimedia.org/T135991) [09:21:24] (03CR) 10jerkins-bot: [V: 04-1] Enable profile::auto_restarts::service for uwsgi/graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/773190 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:23:01] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for uwsgi/graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/773190 (https://phabricator.wikimedia.org/T135991) [09:23:23] (03PS1) 10JMeybohm: Update miscweb to latest scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/773191 (https://phabricator.wikimedia.org/T290966) [09:24:52] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:38] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1079.eqiad.wmnet with OS buster [09:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:47] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1079.eqiad.wmnet with OS buster com... [09:39:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/772869 (https://phabricator.wikimedia.org/T304321) (owner: 10Filippo Giunchedi) [09:39:54] (03PS1) 10Jcrespo: mediabackups: Test mediabackups updates on testwiki only [puppet] - 10https://gerrit.wikimedia.org/r/773192 (https://phabricator.wikimedia.org/T299764) [09:40:16] (03PS2) 10Jcrespo: mediabackups: Test mediabackups updates on testwiki only [puppet] - 10https://gerrit.wikimedia.org/r/773192 (https://phabricator.wikimedia.org/T299764) [09:41:00] 10ops-eqiad: asw2-b-eqiad:FPC5 <-> FPC7 link down - https://phabricator.wikimedia.org/T304488 (10ayounsi) p:05Triage→03High [09:42:45] 10ops-eqiad: asw2-b-eqiad:FPC5 <-> FPC7 link down - https://phabricator.wikimedia.org/T304488 (10ayounsi) [09:43:06] !log pool cp1079 with HAProxy as TLS termination layer - T290005 [09:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:12] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:46:10] ACKNOWLEDGEMENT - Juniper virtual chassis ports on asw2-b-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 ayounsi https://phabricator.wikimedia.org/T304488 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [09:46:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php::monitoring: dupe def PHP_VERSION [puppet] - 10https://gerrit.wikimedia.org/r/773184 (https://phabricator.wikimedia.org/T301945) (owner: 10Hashar) [09:47:11] _joe_: to be fair I have no idea why we had `define( 'PHP_VERSION', php_version() );` maybe it had a specific purpose :\ [09:47:48] <_joe_> hashar: it didn't, it was part of a huge patch series to introduce multiple php engines at the same time, it slipped [09:47:50] once puppet ran on the host we should see a drop at https://logstash.wikimedia.org/goto/5967d326a61573afd237736c95d08a01 [09:47:52] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for nginx/htmldumps [puppet] - 10https://gerrit.wikimedia.org/r/772335 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:47:57] ah good [09:48:02] <_joe_> I was writing 5 languages in the same patches, that kind of stuff [09:48:14] I noticed that when opening logstash which shows the unfiltered event at that one standed out this morning :] [09:48:19] ahah [09:48:25] too many languages [09:48:49] <_joe_> yeah you have puppet, ruby for the templates, bash, php, and some go-langish dsl for mtail, and ofc python [09:50:55] (03CR) 10JMeybohm: [C: 03+2] Switch service type to ClusterIP in case Ingress is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/770556 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:51:04] (03CR) 10JMeybohm: [C: 03+2] Update miscweb to latest scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/773191 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:54:49] (03Merged) 10jenkins-bot: Switch service type to ClusterIP in case Ingress is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/770556 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:56:27] (03PS2) 10JMeybohm: Update miscweb to latest scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/773191 (https://phabricator.wikimedia.org/T290966) [09:56:35] !log depool cp1081 for reimage - T290005 [09:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:43] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:01:08] (03CR) 10MMandere: [C: 03+2] site: Reimage cp1081 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773188 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:04:09] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:07:55] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1081.eqiad.wmnet with OS buster [10:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:05] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1081.eqiad.wmnet with OS buster [10:08:42] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1010 [puppet] - 10https://gerrit.wikimedia.org/r/773193 (https://phabricator.wikimedia.org/T300744) [10:18:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1132 some more weight T301879', diff saved to https://phabricator.wikimedia.org/P23002 and previous config saved to /var/cache/conftool/dbconfig/20220323-101816-marostegui.json [10:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:22] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [10:21:56] (03PS31) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [10:22:29] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Test mediabackups updates on testwiki only [puppet] - 10https://gerrit.wikimedia.org/r/773192 (https://phabricator.wikimedia.org/T299764) (owner: 10Jcrespo) [10:22:59] (03CR) 10Btullis: [V: 03+1 C: 03+2] Reenable the sflow job [puppet] - 10https://gerrit.wikimedia.org/r/772877 (https://phabricator.wikimedia.org/T302263) (owner: 10Btullis) [10:23:54] btullis: merging? [10:24:07] Yes, was about to ask you. [10:24:25] mine is ok, if it is a one line change saying wiki:testwiki [10:24:25] Happy for me to merge a89e890325 for you? [10:24:40] Done, thanks. [10:24:49] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1081.eqiad.wmnet with reason: host reimage [10:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:30] (03CR) 10Ayounsi: "This is not WMF specific so in theory should go in the main homer branch. But realistically it doesn't matter too much :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [10:28:24] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1081.eqiad.wmnet with reason: host reimage [10:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:09] !log restarting ntpd [10:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:02] (03PS1) 10ArielGlenn: include the dumps admins in the dumpsdata role [puppet] - 10https://gerrit.wikimedia.org/r/773195 [10:36:51] (03Restored) 10Jcrespo: test [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344 (owner: 10Jcrespo) [10:37:07] (03PS2) 10Jcrespo: test [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344 [10:37:39] (03CR) 10jerkins-bot: [V: 04-1] test [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344 (owner: 10Jcrespo) [10:37:49] (03PS1) 10MMandere: site: Reimage cp1082 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773196 (https://phabricator.wikimedia.org/T290005) [10:37:51] (03PS1) 10MMandere: site: Reimage cp1080 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773197 (https://phabricator.wikimedia.org/T290005) [10:37:53] RECOVERY - Check systemd state on db1169 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:53] (03PS1) 10MMandere: site: Reimage cp1078 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773198 (https://phabricator.wikimedia.org/T290005) [10:37:55] (03PS1) 10MMandere: site: Reimage cp1076 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773199 (https://phabricator.wikimedia.org/T290005) [10:38:20] (03PS1) 10Jbond: P:mediawiki: add autorestart to httpd and php [puppet] - 10https://gerrit.wikimedia.org/r/773200 [10:39:32] (03PS3) 10Jcrespo: test [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344 [10:40:16] (03PS32) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [10:40:50] (03Abandoned) 10Jbond: P:mediawiki: add autorestart to httpd and php [puppet] - 10https://gerrit.wikimedia.org/r/773200 (owner: 10Jbond) [10:46:20] (03PS1) 10Volans: cluster::management: backup also /home [puppet] - 10https://gerrit.wikimedia.org/r/773202 [10:51:48] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp1082 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773196 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:52:02] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1081.eqiad.wmnet with OS buster [10:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:11] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1081.eqiad.wmnet with OS buster com... [10:52:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) [10:52:38] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp1080 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773197 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:53:23] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp1078 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773198 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:53:35] (03PS4) 10Jcrespo: Add unit testing directory so that CI succeeds [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344 [10:53:59] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp1076 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773199 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:55:59] (03CR) 10Jcrespo: [C: 03+1] "This is ready to go, but see my comment on ticket to see if you want to add more directories now." [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans) [10:56:38] (03CR) 10Jcrespo: [C: 03+2] Add unit testing directory so that CI succeeds [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344 (owner: 10Jcrespo) [10:57:39] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:58:15] 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10jbond) 05Open→03Stalled [10:58:29] !log restarting apache on matomo1002/piwik.wikimedia.org [10:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:38] (03CR) 10Marostegui: "I have cleaned up my cumin2002 directory" [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans) [11:00:07] !log pool cp1081 with HAProxy as TLS termination layer - T290005 [11:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:13] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:01:15] (03PS2) 10Volans: cluster::management: backup also /home [puppet] - 10https://gerrit.wikimedia.org/r/773202 [11:01:16] (03CR) 10Jcrespo: [C: 03+1] cluster::management: backup also /home (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans) [11:02:35] (03PS3) 10Volans: cluster::management: backup also /home [puppet] - 10https://gerrit.wikimedia.org/r/773202 [11:02:37] (03CR) 10Volans: cluster::management: backup also /home (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans) [11:03:10] (03CR) 10Jcrespo: [C: 03+1] cluster::management: backup also /home [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans) [11:04:46] (03CR) 10Muehlenhoff: [C: 03+1] "The 6.1G were from the reimage and are now cleaned out, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans) [11:05:52] (03PS4) 10Volans: cluster::management: backup also /home [puppet] - 10https://gerrit.wikimedia.org/r/773202 [11:07:36] 10Puppet, 10Infrastructure-Foundations: Search broken on puppetboard - https://phabricator.wikimedia.org/T304484 (10Volans) Yes that seems a typo upstream, IIRC I reported that to John a while ago, not sure if it was fixed upstream by now. [11:15:16] (03PS4) 10Muehlenhoff: Enable profile::auto_restarts::service for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/769725 (https://phabricator.wikimedia.org/T135991) [11:15:57] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans) [11:17:12] (03PS1) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) [11:19:41] (03PS3) 10Muehlenhoff: Enable profile::auto_restarts::service for klaxon gunicorn webapp [puppet] - 10https://gerrit.wikimedia.org/r/767516 (https://phabricator.wikimedia.org/T135991) [11:24:31] (03CR) 10Klausman: [C: 03+1] Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [11:25:05] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for imagecatalog [puppet] - 10https://gerrit.wikimedia.org/r/773205 (https://phabricator.wikimedia.org/T135991) [11:33:35] !log installing apache security updates on stretch [11:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:16] !log upload new puppetboard_3.1.0-1+deb11u1_all.deb [11:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:22] (03CR) 10Volans: [C: 03+2] cluster::management: backup also /home [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans) [11:44:08] 10Puppet, 10Infrastructure-Foundations: Search broken on puppetboard - https://phabricator.wikimedia.org/T304484 (10jbond) 05Open→03Resolved a:03jbond I have deployed an update which has fixed this, please reopen if i missed something [11:46:14] 10SRE, 10Thumbor, 10serviceops, 10Service-deployment-requests: New Service Request Wikimedia-Thumbor - https://phabricator.wikimedia.org/T304436 (10jbond) p:05Triage→03Medium [11:46:41] 10SRE, 10ops-eqiad, 10serviceops: mc1053 PS redundancy alert - https://phabricator.wikimedia.org/T304477 (10jbond) p:05Triage→03Medium [11:47:41] 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10TomekSikora.Monsoon) [11:49:35] 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10jbond) [11:57:34] (03PS1) 10Jbond: admin: add sgimeno user [puppet] - 10https://gerrit.wikimedia.org/r/773207 (https://phabricator.wikimedia.org/T304361) [11:58:50] (03CR) 10Jbond: [C: 03+2] admin: add sgimeno user [puppet] - 10https://gerrit.wikimedia.org/r/773207 (https://phabricator.wikimedia.org/T304361) (owner: 10Jbond) [11:59:57] 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10Aklapper) 05Open→03Stalled @TomekSikora.Monsoon: Hi. If this is a serious request and not a test, then please edit the task title (which RESOURCE?), and fill in ALL fields in the description. [12:04:40] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10jbond) 05Open→03Resolved a:03jbond @Sgs access has now been set up you shuld have recived an email indicating how to configure kerberos, please re-open if you are s... [12:07:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132 after testing', diff saved to https://phabricator.wikimedia.org/P23003 and previous config saved to /var/cache/conftool/dbconfig/20220323-120749-marostegui.json [12:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:39] (03CR) 10Ladsgroup: "I understand you have a large backlog but this is three weeks now." [dumps] - 10https://gerrit.wikimedia.org/r/767477 (https://phabricator.wikimedia.org/T138208) (owner: 10Ladsgroup) [12:16:40] (03CR) 10Jakob: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773209 (https://phabricator.wikimedia.org/T302959) (owner: 10Jakob) [12:27:55] (03CR) 10Ladsgroup: [C: 04-1] drop_gu_hidden_T302658.py: New schema change (032 comments) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) (owner: 10Marostegui) [12:29:16] !log restarting Turnilo for OpenSSL update [12:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:31] (03PS1) 10Sbisson: Add Wikistories extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773212 (https://phabricator.wikimedia.org/T303004) [12:31:47] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:06] (03PS1) 10Jbond: C:icinga::monitor::cloudelastic: refactor to make a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321) [12:35:08] (03PS4) 10Marostegui: drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) [12:35:12] (03PS1) 10Jbond: C:icinga::monitor::cloudelastic: Add checkes for certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/773214 (https://phabricator.wikimedia.org/T304321) [12:36:33] (03CR) 10Volans: "I didn't had a chance yet to give it a pass to the code, but I've left a comment on the packaging." [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [12:38:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34514/console" [puppet] - 10https://gerrit.wikimedia.org/r/773214 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [12:40:18] (03CR) 10Ladsgroup: [C: 03+1] drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) (owner: 10Marostegui) [12:40:30] (03PS2) 10Jbond: C:icinga::monitor::cloudelastic: refactor to make a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321) [12:43:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34515/console" [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [12:47:30] (03PS2) 10Jbond: nagios_common: remove -C from check_http [puppet] - 10https://gerrit.wikimedia.org/r/772869 (https://phabricator.wikimedia.org/T304321) (owner: 10Filippo Giunchedi) [12:47:32] (03PS1) 10Jbond: C:nagios_common: add new check for check_https_expiry [puppet] - 10https://gerrit.wikimedia.org/r/773215 (https://phabricator.wikimedia.org/T304321) [12:51:47] (03CR) 10Tchanders: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders) [12:52:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34518/console" [puppet] - 10https://gerrit.wikimedia.org/r/773215 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [12:52:38] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:nagios_common: add new check for check_https_expiry [puppet] - 10https://gerrit.wikimedia.org/r/773215 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [12:53:05] (03PS3) 10Jbond: C:icinga::monitor::cloudelastic: refactor to make a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321) [12:53:53] (03PS2) 10Jbond: C:icinga::monitor::cloudelastic: Add checkes for certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/773214 (https://phabricator.wikimedia.org/T304321) [12:55:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34519/console" [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [12:56:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T300775)', diff saved to https://phabricator.wikimedia.org/P23004 and previous config saved to /var/cache/conftool/dbconfig/20220323-125625-marostegui.json [12:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:30] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [12:58:22] !log installing bind security updates [12:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:11] (03PS1) 10Jbond: C:icinga::commons: Add ssl expiry checks for commons [puppet] - 10https://gerrit.wikimedia.org/r/773217 (https://phabricator.wikimedia.org/T304321) [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:44] uh [13:00:56] did I add my change to the wrong window? [13:01:05] damn, I added it tomorrow [13:01:52] should be better now [13:02:11] but anyways – I’m still eating lunch, so if there are no other changes in the window, I’ll be back in half an hour or so :) [13:02:43] (03PS2) 10Jbond: C:icinga::commons: Add ssl expiry checks for commons [puppet] - 10https://gerrit.wikimedia.org/r/773217 (https://phabricator.wikimedia.org/T304321) [13:02:45] (03PS1) 10Jbond: C:icinga::commons: Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773218 (https://phabricator.wikimedia.org/T304321) [13:05:59] (03PS1) 10Jbond: C:icinga::commons: Add ssl expiry checks for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/773219 (https://phabricator.wikimedia.org/T304321) [13:07:48] !log depool cp1082 for reimage - T290005 [13:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:54] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:09:08] (03PS1) 10Jbond: C:icinga::gitlab: Add ssl expiry checks for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/773220 (https://phabricator.wikimedia.org/T304321) [13:11:11] (03PS3) 10Filippo Giunchedi: nagios_common: remove -C from check_http [puppet] - 10https://gerrit.wikimedia.org/r/772869 (https://phabricator.wikimedia.org/T304321) [13:11:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P23005 and previous config saved to /var/cache/conftool/dbconfig/20220323-131130-marostegui.json [13:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:37] (03CR) 10MMandere: [C: 03+2] site: Reimage cp1082 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773196 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [13:13:42] (03PS1) 10Jbond: C:lvs::monitor_services: Add ssl expiry checks for lvs [puppet] - 10https://gerrit.wikimedia.org/r/773221 (https://phabricator.wikimedia.org/T304321) [13:14:00] (03CR) 10Filippo Giunchedi: [C: 03+2] nagios_common: remove -C from check_http [puppet] - 10https://gerrit.wikimedia.org/r/772869 (https://phabricator.wikimedia.org/T304321) (owner: 10Filippo Giunchedi) [13:14:34] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1010 [puppet] - 10https://gerrit.wikimedia.org/r/773193 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [13:16:33] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1082.eqiad.wmnet with OS buster [13:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:42] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1082.eqiad.wmnet with OS buster [13:17:04] (03CR) 10Marostegui: [C: 03+2] drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) (owner: 10Marostegui) [13:17:30] (03Merged) 10jenkins-bot: drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) (owner: 10Marostegui) [13:18:09] (03PS1) 10Jbond: C:noc: Add ssl expiry checks for noc [puppet] - 10https://gerrit.wikimedia.org/r/773223 (https://phabricator.wikimedia.org/T304321) [13:19:31] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1010.eqiad.wmnet with OS bullseye [13:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:14] PROBLEM - puppetboard-samltest.wikimedia.org requires authentication on puppetboard2002 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-samltest.wikimedia.org:443/ - 582 bytes in 1.144 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:20:27] PROBLEM - Check to ensure the cfssl signer is working CA: cloud_wmnet_ca #page on pki2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - string success:true not found on https://pki.discovery.wmnet:443/api/v1/cfssl/info - 446 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/PKI/CA_Operations [13:20:35] godog, jbond ^^^ [13:20:46] looking [13:20:47] thanks volans, indeed [13:20:48] PROBLEM - puppetboard-idptest.wikimedia.org requires authentication on puppetboard1002 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-idptest.wikimedia.org:443/ - 580 bytes in 1.009 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:20:49] acking [13:21:20] so nothing is broken per-se, or at least not more broken than five minutes ago [13:21:36] thanks godog, ill fix this, looks like the url check never worked [13:21:48] well is using the wrong url [13:22:18] jbond: yeah, the pki alert or puppetboard-saml or both ? [13:22:52] PROBLEM - Check to ensure the cfssl signer is working CA: cloud_wmnet_ca #page on pki1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - string success:true not found on https://pki.discovery.wmnet:443/api/v1/cfssl/info - 446 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/PKI/CA_Operations [13:23:05] ditto ^ [13:23:09] * volans acked on VO [13:23:11] * Emperor here [13:23:25] ...just a bit too late, as ever :-/ [13:23:26] * jhathaway here as well [13:23:40] nothing to see here sorry for the noise [13:24:02] np [13:24:12] indeed sorry for the mispages, all for the better though at least [13:24:18] may I suggest to add some downtime to the modified checks so to spot the failing ones on icinga without having to page? [13:25:00] PROBLEM - Check to ensure the cfssl signer is working CA: debmonitor #page on pki1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - string success:true not found on https://pki.discovery.wmnet:443/api/v1/cfssl/info - 446 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/PKI/CA_Operations [13:25:02] (03PS1) 10Jbond: C:openstack::keystone: Add ssl expiry checks for keystone [puppet] - 10https://gerrit.wikimedia.org/r/773224 (https://phabricator.wikimedia.org/T304321) [13:25:18] volans: yes will do [13:26:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P23008 and previous config saved to /var/cache/conftool/dbconfig/20220323-132635-marostegui.json [13:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:39] Funny I didn't get paged with vo app just the text. But I got push notification when it got resolved 🤦🤦🤦 [13:27:12] {{done}} downtimed the cfssl p a g e alerts [13:28:25] alright, I’m back [13:28:41] can I proceed with the backport+config window or is something going on? [13:28:58] (KubernetesCalicoDown) firing: kubernetes1010.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:29:36] Lucas_WMDE: good to go I think [13:29:41] great, thanks [13:33:14] (03PS2) 10Lucas Werkmeister (WMDE): Write "unexpectedUnconnectedPage" page prop on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768090 [13:33:25] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1082.eqiad.wmnet with reason: host reimage [13:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:55] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1010.eqiad.wmnet with reason: host reimage [13:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:07] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1027.eqiad.wmnet with OS bullseye [13:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:16] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1027.eqiad.wmnet with OS bullseye [13:35:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Write "unexpectedUnconnectedPage" page prop on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768090 (owner: 10Lucas Werkmeister (WMDE)) [13:36:28] (03Merged) 10jenkins-bot: Write "unexpectedUnconnectedPage" page prop on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768090 (owner: 10Lucas Werkmeister (WMDE)) [13:36:57] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1082.eqiad.wmnet with reason: host reimage [13:37:00] testing on mwdebug1001 [13:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:13] PROBLEM - puppetboard-idptest.wikimedia.org requires authentication on puppetboard2002 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-idptest.wikimedia.org:443/ - 580 bytes in 1.144 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:38:04] seems to be working fine, syncing [13:38:57] !log restarting superset for OpenSSL update [13:38:58] (KubernetesCalicoDown) resolved: kubernetes1010.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:27] kubernetes1010 is me, reimaging [13:39:31] (03PS1) 10Jbond: P:pki: fix nagios checks for PKI [puppet] - 10https://gerrit.wikimedia.org/r/773227 [13:39:44] (03PS2) 10Lucas Werkmeister (WMDE): Enable Wikibase REST API on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773209 (https://phabricator.wikimedia.org/T302959) (owner: 10Jakob) [13:39:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1010.eqiad.wmnet with reason: host reimage [13:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:48] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:768090|Write "unexpectedUnconnectedPage" page prop on Test Wikidata clients]] (duration: 01m 10s) [13:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:26] (03PS2) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) [13:41:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T300775)', diff saved to https://phabricator.wikimedia.org/P23009 and previous config saved to /var/cache/conftool/dbconfig/20220323-134140-marostegui.json [13:41:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [13:41:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [13:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:41:45] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [13:41:47] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable Wikibase REST API on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773209 (https://phabricator.wikimedia.org/T302959) (owner: 10Jakob) [13:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34520/console" [puppet] - 10https://gerrit.wikimedia.org/r/773227 (owner: 10Jbond) [13:41:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T300775)', diff saved to https://phabricator.wikimedia.org/P23010 and previous config saved to /var/cache/conftool/dbconfig/20220323-134153-marostegui.json [13:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:26] (03Merged) 10jenkins-bot: Enable Wikibase REST API on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773209 (https://phabricator.wikimedia.org/T302959) (owner: 10Jakob) [13:42:28] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki: fix nagios checks for PKI [puppet] - 10https://gerrit.wikimedia.org/r/773227 (owner: 10Jbond) [13:43:11] checking that the beta change does nothing on mwdebug1001… [13:43:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:04] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:31] (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [13:44:42] looks good I think, I’ll sync it [13:45:09] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1029.eqiad.wmnet with OS bullseye [13:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:28] (KubernetesCalicoDown) firing: kubernetes1010.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:46:16] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1027.eqiad.wmnet with reason: host reimage [13:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:26] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:773209|Enable Wikibase REST API on beta wikidata (T302959)]] (1/2, production no-op) (duration: 01m 07s) [13:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:30] T302959: Create a test/validation system for the Wikibase REST API - https://phabricator.wikimedia.org/T302959 [13:47:06] PROBLEM - mailman archives on lists1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Wikimedia Mailing List not found on https://lists.wikimedia.org:443/hyperkitty/list/wikimedia-l@lists.wikimedia.org/ - 47822 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:47:43] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:773209|Enable Wikibase REST API on beta wikidata (T302959)]] (2/2, production no-op) (duration: 01m 05s) [13:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:48:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:26] !log UTC afternoon backport window done [13:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:01] PROBLEM - mailman list info on lists1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Wikimedia Mailing List not found on https://lists.wikimedia.org:443/postorius/lists/wikimedia-l.lists.wikimedia.org/ - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:49:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:43] (KubernetesCalicoDown) resolved: kubernetes1010.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:50:53] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1027.eqiad.wmnet with reason: host reimage [13:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:48] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1010.eqiad.wmnet with OS bullseye [13:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:01] (03PS1) 10Jbond: PKI: double escape, one for puppet one for icinga [puppet] - 10https://gerrit.wikimedia.org/r/773231 [13:54:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:49] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1029.eqiad.wmnet with reason: host reimage [13:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:56] (03PS1) 10Urbanecm: addWiki: Create GrowthExperiment's tables for all new Wikipedia [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772900 (https://phabricator.wikimedia.org/T304052) [13:56:04] jouncebot: nowandnext [13:56:05] For the next 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T1300) [13:56:05] In 1 hour(s) and 3 minute(s): New wiki creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T1500) [13:56:21] (03PS1) 10Urbanecm: addWiki: Create GrowthExperiment's tables for all new Wikipedia [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/772901 (https://phabricator.wikimedia.org/T304052) [13:56:37] (03CR) 10Jbond: [C: 03+2] PKI: double escape, one for puppet one for icinga [puppet] - 10https://gerrit.wikimedia.org/r/773231 (owner: 10Jbond) [13:56:46] (03CR) 10Urbanecm: [C: 03+2] "deploying" [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/772901 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm) [13:56:47] PROBLEM - puppetboard-samltest.wikimedia.org requires authentication on puppetboard1002 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-samltest.wikimedia.org:443/ - 582 bytes in 1.007 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:57:04] since wmf.4's not at deploy1002 yet, just +2'ed to ensure it will ride with wmf.4 [13:57:23] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [13:57:25] will do wmf.3 soon, so i can ensure the change works in the new wiki creation window in an hour [13:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:37] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Hue [puppet] - 10https://gerrit.wikimedia.org/r/773232 (https://phabricator.wikimedia.org/T135991) [13:57:50] (03PS1) 10Majavah: openstack::nova::fullstack: restart service on setting changes [puppet] - 10https://gerrit.wikimedia.org/r/773233 [13:57:52] (03PS1) 10Majavah: openstack::nova::fullstack: use bullseye image [puppet] - 10https://gerrit.wikimedia.org/r/773234 [13:58:40] (03Merged) 10jenkins-bot: addWiki: Create GrowthExperiment's tables for all new Wikipedia [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/772901 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm) [13:58:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:58:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:13] (03CR) 10Filippo Giunchedi: [C: 03+1] puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [13:59:28] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable profile::auto_restarts::service for uwsgi/graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/773190 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:59:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1029.eqiad.wmnet with reason: host reimage [13:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:48] (03PS2) 10Urbanecm: Initial configuration for shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771546 (https://phabricator.wikimedia.org/T302797) [14:00:22] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [14:00:25] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1082.eqiad.wmnet with OS buster [14:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:36] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1082.eqiad.wmnet with OS buster com... [14:00:44] (03PS2) 10Urbanecm: Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727) [14:02:55] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727) (owner: 10Urbanecm) [14:04:19] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [14:04:19] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [14:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:32] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [14:04:32] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [14:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:51] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [14:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:02] !log pool cp1082 with HAProxy as TLS termination layer - T290005 [14:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:06] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:08:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:08:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:08] (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova::fullstack: restart service on setting changes [puppet] - 10https://gerrit.wikimedia.org/r/773233 (owner: 10Majavah) [14:10:59] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1027.eqiad.wmnet with OS bullseye [14:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:09] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1027.eqiad.wmnet with OS bullseye completed... [14:11:55] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [14:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:55] (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova::fullstack: use bullseye image [puppet] - 10https://gerrit.wikimedia.org/r/773234 (owner: 10Majavah) [14:13:27] (03PS2) 10Kosta Harlan: betalabs: Enable Watchlist Echo notifications feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941) [14:14:33] (03PS3) 10Kosta Harlan: betalabs: Enable Watchlist Echo notifications feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941) [14:14:39] (03PS1) 10Jbond: PKI: add '}' back [puppet] - 10https://gerrit.wikimedia.org/r/773237 [14:14:58] (03CR) 10Jbond: [V: 03+2 C: 03+2] PKI: add '}' back [puppet] - 10https://gerrit.wikimedia.org/r/773237 (owner: 10Jbond) [14:15:42] (03PS4) 10Kosta Harlan: betalabs: Enable Watchlist Echo notifications feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941) [14:16:37] urbanecm: ^ is it OK to +2 this, or do we need to do a sync as well? [14:17:21] it only touches a -labs.php file, so you need to +2 and pull to deploy1002 but you don't need to sync it [14:17:35] 10SRE, 10ops-eqiad: asw2-b-eqiad:FPC5 <-> FPC7 link down - https://phabricator.wikimedia.org/T304488 (10Jclark-ctr) 05Open→03Resolved Found Dac cable in rack B7 not seated reseated cable and confirmed link with @ayounsi [14:18:18] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1029.eqiad.wmnet with OS bullseye [14:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:13] !log bking@cumin1001 conftool action : set/pooled=yes; selector: name=wcqs1002.eqiad.wmnet [14:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:58] 10SRE, 10Infrastructure-Foundations, 10netops: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10cmooney) Thinking about this further I think it works from the CRs because the peering is from the local public/private subnet to the loopbac... [14:20:03] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [14:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:09] taavi: thanks. I assume it's OK for me to +2 it since I've gotten +1s from two others, and it's -labs only [14:22:53] yeah, sounds fine to me [14:23:51] (03CR) 10Kosta Harlan: [C: 03+2] "Per Martin & Sergio" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941) (owner: 10Kosta Harlan) [14:23:56] !log reboot cp1085 (downtimed) [14:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:32] (03Merged) 10jenkins-bot: betalabs: Enable Watchlist Echo notifications feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941) (owner: 10Kosta Harlan) [14:25:39] taavi: how do I pull to deploy1002? [14:26:34] scap sync-file? [14:27:07] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [14:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:22] cd to /srv/mediawiki-staging and then just git fetch && git rebase [14:28:32] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [14:28:33] (03PS1) 10Lucas Werkmeister (WMDE): Write "unexpectedUnconnectedPage" page prop everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773239 [14:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:00] (03CR) 10Lucas Werkmeister (WMDE): "I think we can deploy this tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773239 (owner: 10Lucas Werkmeister (WMDE)) [14:29:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:30:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:16] kostajh: are you deploying something please? [14:33:31] (I'd like to, so that's why I'm asking) [14:33:34] PROBLEM - Host wcqs2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:41] urbanecm: I just +2'ed that beta labs config patch [14:33:43] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1030.eqiad.wmnet with OS bullseye [14:33:46] but didn't do anything else yet [14:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:05] s/beta labs/beta cluster/ please [14:34:17] :)) [14:34:18] urbanecm: I'll do the git fetch and rebase step in mediawiki-staging [14:34:20] heh, sure [14:34:25] kostajh: okay, please ping me once done :) [14:34:58] I don't see the patch in `git log` on mediawiki-staging. Does it take some time to show up? [14:35:00] (03CR) 10JMeybohm: [C: 04-1] Add helm charts and a helmfile configuration for datahub (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:35:04] RECOVERY - Host wcqs2001 is UP: PING OK - Packet loss = 0%, RTA = 32.73 ms [14:35:25] kostajh: you need to do git fetch manually [14:35:28] there's no autopull [14:35:45] urbanecm: I've done that [14:35:55] (03PS2) 10MMandere: site: Reimage cp1080 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773197 (https://phabricator.wikimedia.org/T290005) [14:36:03] did you rebase too? [14:36:04] kostajh: okay. Does git log -p HEAD..@{u} show your patch? [14:36:08] (and only your patch) [14:36:11] if so, do git rebase [14:36:40] urbanecm: ah, ok. done [14:36:42] thanks. [14:36:43] over to you [14:36:46] thanks [14:37:10] (03CR) 10Urbanecm: [C: 03+2] addWiki: Create GrowthExperiment's tables for all new Wikipedia [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772900 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm) [14:37:32] !log depool cp1080 for reimage - T290005 [14:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:37] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:38:12] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp1085.eqiad.wmnet [14:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:14] (03Merged) 10jenkins-bot: addWiki: Create GrowthExperiment's tables for all new Wikipedia [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772900 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm) [14:40:01] RECOVERY - Check to ensure the cfssl signer is working CA: debmonitor #page on pki1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1756 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/PKI/CA_Operations [14:40:20] (03CR) 10MMandere: [C: 03+2] site: Reimage cp1080 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773197 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [14:41:47] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.3/extensions/WikimediaMaintenance/addWiki.php: 9a0aed0: addWiki: Create GrowthExperiment tables for all new Wikipedias (T304052) (duration: 01m 06s) [14:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:53] T304052: Enable Growth features on Wikipedias upon creation - https://phabricator.wikimedia.org/T304052 [14:41:55] done with deployment for now [14:42:02] (will be back in ~15 mins for the wiki creation window) [14:44:39] (03PS1) 10Jbond: P:pki::multirootca::monitoring: triple escape :/ [puppet] - 10https://gerrit.wikimedia.org/r/773243 [14:44:39] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1080.eqiad.wmnet with OS buster [14:44:40] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1030.eqiad.wmnet with reason: host reimage [14:44:42] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1030.eqiad.wmnet with reason: host reimage [14:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:49] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1080.eqiad.wmnet with OS buster [14:45:06] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1031.eqiad.wmnet with OS bullseye [14:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34522/console" [puppet] - 10https://gerrit.wikimedia.org/r/773243 (owner: 10Jbond) [14:46:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca::monitoring: triple escape :/ [puppet] - 10https://gerrit.wikimedia.org/r/773243 (owner: 10Jbond) [14:46:32] RECOVERY - Juniper virtual chassis ports on asw2-b-eqiad is OK: OK: UP: 24 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:46:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:47:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:06] (03CR) 10BBlack: [C: 03+2] map Portugal to drmrs [dns] - 10https://gerrit.wikimedia.org/r/772876 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [14:48:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:08] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [14:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:47] (03PS1) 10BBlack: map Spain to drmrs [dns] - 10https://gerrit.wikimedia.org/r/773244 (https://phabricator.wikimedia.org/T304089) [14:50:49] (03PS1) 10BBlack: map France to drmrs [dns] - 10https://gerrit.wikimedia.org/r/773245 (https://phabricator.wikimedia.org/T304089) [14:51:26] RECOVERY - Check to ensure the cfssl signer is working CA: cloud_wmnet_ca #page on pki1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1770 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/PKI/CA_Operations [14:54:04] RECOVERY - Check to ensure the cfssl signer is working CA: cloud_wmnet_ca #page on pki2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1770 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/PKI/CA_Operations [14:54:57] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 85 probes of 675 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:59:19] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1031.eqiad.wmnet with reason: host reimage [14:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:05] Urbanecm and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for New wiki creation . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T1500). [15:00:09] o/ [15:00:12] Amir1: let's start? [15:00:21] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 61 probes of 675 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:00:24] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1030.eqiad.wmnet with OS bullseye [15:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:35] sure! [15:00:43] okay, +2'ing the first one [15:00:49] (03PS3) 10Urbanecm: Initial configuration for shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771546 (https://phabricator.wikimedia.org/T302797) [15:00:55] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771546 (https://phabricator.wikimedia.org/T302797) (owner: 10Urbanecm) [15:01:28] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1080.eqiad.wmnet with reason: host reimage [15:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:59] (03Merged) 10jenkins-bot: Initial configuration for shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771546 (https://phabricator.wikimedia.org/T302797) (owner: 10Urbanecm) [15:02:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1031.eqiad.wmnet with reason: host reimage [15:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:39] pulling to mwmaint [15:03:54] (03PS1) 10Jbond: P:puppetboard: don't monitor testing sites [puppet] - 10https://gerrit.wikimedia.org/r/773248 [15:04:14] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:puppetboard: don't monitor testing sites [puppet] - 10https://gerrit.wikimedia.org/r/773248 (owner: 10Jbond) [15:04:30] running addwiki [15:05:02] db was created at db1130, which is s5 primary [15:05:06] pulling to mwdebug [15:05:37] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1080.eqiad.wmnet with reason: host reimage [15:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:13] wiki works, syncing [15:08:26] !log urbanecm@deploy1002 Synchronized wmf-config/db-production.php: Creating shnwikivoyage (T302797) (duration: 01m 05s) [15:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:31] T302797: Create Wikivoyage Shan - https://phabricator.wikimedia.org/T302797 [15:08:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:09:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:39] (03PS2) 10Jbond: C:icinga::commons: Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773218 (https://phabricator.wikimedia.org/T304321) [15:09:39] !log urbanecm@deploy1002 Synchronized dblists: Creating shnwikivoyage (T302797) (duration: 01m 05s) [15:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:31] (03PS3) 10Urbanecm: Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727) [15:11:05] (03PS4) 10Urbanecm: Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727) [15:12:02] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating shnwikivoyage (T302797) [15:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:10] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating shnwikivoyage (T302797) (duration: 01m 05s) [15:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:46] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727) (owner: 10Urbanecm) [15:14:19] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating shnwikivoyage (T302797) (duration: 01m 05s) [15:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:23] T302797: Create Wikivoyage Shan - https://phabricator.wikimedia.org/T302797 [15:14:31] (03Merged) 10jenkins-bot: Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727) (owner: 10Urbanecm) [15:14:33] and the last sync... [15:15:03] zabe: ah I see you're doing the exact same thing I am :P [15:15:17] taavi: acquiring low IDs? [15:15:20] yes [15:15:27] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating shnwikivoyage (T302797) (duration: 01m 05s) [15:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:36] (03PS1) 10Jbond: P:chartmuseum: Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773249 (https://phabricator.wikimedia.org/T304321) [15:15:42] taavi: I'm still mad about mailman [15:15:49] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773218 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:15:49] the apache was not properly up [15:15:52] :( [15:16:04] taavi, it's the same game as always :p [15:16:14] in your defense, that's quite an achievement [15:16:17] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) For the Ingress part we will need to use two different names/discovery records for the services (as we can't distinguish by port). Maybe `datahub.disc... [15:16:19] no one beated Maintenance script so far :)) [15:16:31] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) a:03JMeybohm [15:16:42] okay, let's see if my change to addWiki.php works [15:17:27] urbanecm: the growth table? I'm not sure if it's deployed yet [15:17:46] (03PS2) 10MMandere: site: Reimage cp1078 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773198 (https://phabricator.wikimedia.org/T290005) [15:17:47] i backported it earlier today [15:17:47] I would need to convince someone to +2 a addWiki.php patch if I wanted to beat maintenance script [15:17:48] (03PS2) 10MMandere: site: Reimage cp1076 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773199 (https://phabricator.wikimedia.org/T290005) [15:17:50] (03PS1) 10MMandere: site: Reimage cp2033 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773250 (https://phabricator.wikimedia.org/T290005) [15:17:52] (03PS1) 10MMandere: site: Reimage cp2031 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773251 (https://phabricator.wikimedia.org/T290005) [15:17:54] (03PS1) 10MMandere: site: Reimage cp2029 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773252 (https://phabricator.wikimedia.org/T290005) [15:17:56] (03PS1) 10MMandere: site: Reimage cp2027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773253 (https://phabricator.wikimedia.org/T290005) [15:18:20] and the tables are there too [15:18:26] so it worked :)) [15:18:43] and the wiki's up too, so...syncing [15:18:44] (03PS1) 10Jbond: P:debmonitor::server: Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773254 (https://phabricator.wikimedia.org/T304321) [15:19:29] It was possible some time ago when maintenance script was broken, e.g. shiwiki [15:19:54] !log urbanecm@deploy1002 Synchronized wmf-config/db-production.php: Creating guwwiki (T303727) (duration: 01m 05s) [15:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:01] T303727: Create Wikipedia Gungbe - https://phabricator.wikimedia.org/T303727 [15:20:10] I think it's actually fairly recent that it's using User:Maintenance_script, previously those edits were attributed to 127.0.0.1 [15:20:15] yup yup [15:20:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:53] (03PS1) 10JMeybohm: Allow multiple tlsHostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/773255 (https://phabricator.wikimedia.org/T290966) [15:20:57] (03PS1) 10JMeybohm: Add correct tlsHostnames and extra SAN to datahub cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/773256 (https://phabricator.wikimedia.org/T303049) [15:21:14] !log urbanecm@deploy1002 Synchronized dblists: Creating guwwiki (T303727) (duration: 01m 10s) [15:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:24] (03PS1) 10Jbond: P:docker_registry_ha::registry: Add ssl expiry checks [puppet] - 10https://gerrit.wikimedia.org/r/773257 (https://phabricator.wikimedia.org/T304321) [15:21:27] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) >>! In T303049#7800172, @JMeybohm wrote: > For the Ingress part we will need to use two different names/discovery records for the services (as we can't... [15:21:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:21:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:34] (03CR) 10Volans: [C: 03+1] "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/773218 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:22:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:53] hmm, addwiki.php is throwing some PHP Deprecated: Deprecated cross-wiki access to MediaWiki\Revision\RevisionRecord. Expected: the local wiki, Actual: 'guwwiki'. [15:23:03] :( [15:23:08] zabe: can you check if it has a task? [15:23:09] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating guwwiki (T303727) [15:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:40] yes, I can't find one, let me create one [15:23:52] thanks zabe [15:23:58] first time i see scap saying `15:23:16 Huh, lock file disappeared before deletion. This is probably fine-ish :)` [15:24:07] i guess that's because i do a lot of syncs now? [15:24:22] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating guwwiki (T303727) (duration: 01m 06s) [15:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:53] urbanecm: Hmm... I'll check the code [15:25:08] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1031.eqiad.wmnet with OS bullseye [15:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:29] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating guwwiki (T303727) (duration: 01m 05s) [15:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:34] T303727: Create Wikipedia Gungbe - https://phabricator.wikimedia.org/T303727 [15:26:38] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating guwwiki (T303727) (duration: 01m 07s) [15:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:49] !log urbanecm@deploy1002 Synchronized langlist: Creating guwwiki (T303727) (duration: 01m 04s) [15:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:55] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) >>! In T303049#7800187, @BTullis wrote: > It's not going to affect the public-facing (but authenticated) URL of https://datahub.wikimedia.org for the... [15:27:56] okay, per wiki syncs are done now [15:28:03] updating interwiki cache now [15:28:10] created T304528 [15:28:11] T304528: PHP Deprecated: Deprecated cross-wiki access to MediaWiki\Revision\RevisionRecord. Expected: the local wiki, Actual: 'guwwiki'. Pass expected $wikiId. [Called from MediaWiki\Revision\RevisionRecord::getPageId] - https://phabricator.wikimedia.org/T304528 [15:28:18] if only it worked... [15:28:45] i can't run scap update-interwiki-cache https://www.irccloud.com/pastebin/SzrfJDJ1/ [15:28:50] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1080.eqiad.wmnet with OS buster [15:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:58] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1080.eqiad.wmnet with OS buster com... [15:30:47] Amir1: taavi: zabe: any idea wh that's happening? [15:31:05] i see that error thrown at https://gerrit.wikimedia.org/g/mediawiki/core/+/77e159c161a7b83ebe72d4c614674aaf64f7f0fc/includes/interwiki/ClassicInterwikiLookup.php#130, but...wgInterwikiCache should be an array [15:31:42] interwiki cache is not urgent [15:31:50] but yeah, messed up [15:31:58] !log pool cp1080 with HAProxy as TLS termination layer - T290005 [15:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:03] it's not, I'm just wondering what happened with it :) [15:32:03] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:32:09] happy to phabricatorize & leave for later [15:32:11] I think it might be due to configuration handling changes in core [15:32:33] yeah, let's have a phabricator ticket for it [15:32:35] there was some refactoring that happened [15:32:47] I guess mwscript extensions/WikimediaMaintenance/dumpInterwiki.php --wiki=aawiki should work as alternative [15:33:10] hmm, that works [15:33:26] weird [15:34:02] and `/usr/local/bin/mwscript extensions/WikimediaMaintenance/dumpInterwiki.php`, which is what scap update-interwiki-cache runs, works too [15:34:35] * urbanecm is confused [15:35:32] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:35:36] urbanecm: Do you have full transcript that includes that warning about the lockfile being missing? [15:35:47] dancy: should be in my scrollback, gimme a sec [15:36:40] dancy: fyi also filled T304529 about scap (the interwiki issue). [15:36:40] T304529: scap update-interwiki-cache throws MWException: Setting $wgInterwikiCache to a CDB path is no longer supported - https://phabricator.wikimedia.org/T304529 [15:37:15] dancy: unfortunately, the lockfile part of the scrollback is gone now. but it was a regular sync, with regular messages, just this one appeared at the top [15:37:21] and it happened for a single deployment only [15:37:29] syncs before and after worked fine [15:38:04] Hmm.. no use of control-c ? [15:38:15] nope [15:38:29] just copy&pasting scap sync-file's to my bash session [15:38:38] alright. thanks [15:38:51] !log Created shnwikivoyage and guwwiki [15:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:16] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:39:50] !log foreachwikiindblist wikipedia extensions/WikimediaMaintenance/createExtensionTables.php growthexperiments # T304052 [15:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:57] (03CR) 10STran: Allow autoconfirmed users to view basic IP information (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders) [15:39:58] T304052: Enable Growth features on Wikipedias upon creation - https://phabricator.wikimedia.org/T304052 [15:40:56] (03CR) 10Urbanecm: Allow autoconfirmed users to view basic IP information (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders) [15:42:29] I'll use the remainder of my window to test the rest of T304052 (now that the tables are at all Wikipedias) [15:46:34] (03CR) 10STran: [C: 03+1] Allow autoconfirmed users to view basic IP information [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders) [15:46:52] Amir1: when accessing guw.wikipedia.org from my staff acc, i get a `2022-03-23 15:45:50 [0d2f325a-dfca-46f8-ae9f-036af9c33950] mw1320 guwwiki 1.39.0-wmf.3 exception ERROR: [0d2f325a-dfca-46f8-ae9f-036af9c33950] / Wikimedia\Rdbms\DBQueryError: Error 1205: Lock wait timeout exceeded; try restarting transaction (db1130)` :( [15:46:59] (03PS1) 10Majavah: admin: add developer-portal namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/773267 (https://phabricator.wikimedia.org/T297140) [15:47:04] `Query: INSERT IGNORE INTO `user_properties` (up_user,up_property,up_value) VALUES (41,'VectorSkinVersion','1')`, from `MediaWiki\User\UserOptionsManager::saveOptionsInternal` [15:47:17] :/ [15:47:32] let me check [15:47:35] (03PS1) 10Majavah: Add dummy tokens for developer-portal [labs/private] - 10https://gerrit.wikimedia.org/r/773268 (https://phabricator.wikimedia.org/T297140) [15:47:45] funnily enough, user_id=41 matches zero rows [15:47:57] oh I've seen that before, I think the last update on that task was 'it was fixed' [15:48:20] (03PS1) 10Majavah: Add developer-portal k8s accounts [puppet] - 10https://gerrit.wikimedia.org/r/773270 (https://phabricator.wikimedia.org/T297140) [15:48:39] taavi: you've seen that for newly born wikis, or in general? [15:48:54] in general when creating accounts [15:48:59] i see [15:49:03] lemme try to find that task [15:49:18] https://phabricator.wikimedia.org/T294995 [15:49:28] yeah, basically when two users trying to be created at the same time [15:50:18] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1032.eqiad.wmnet with OS bullseye [15:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:26] so it's still not fixed :/ [15:50:37] looks so :/ [15:50:43] taavi: thanks for the link, left a comment there [15:51:36] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:52:02] although the stack is a different one [15:52:11] (03PS1) 10Majavah: kubeadm::helm: install helmfile [puppet] - 10https://gerrit.wikimedia.org/r/773271 (https://phabricator.wikimedia.org/T304532) [15:52:20] * urbanecm done with T304052 testing [15:55:53] (03PS1) 10Jbond: O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272 [15:56:29] (03CR) 10jerkins-bot: [V: 04-1] O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272 (owner: 10Jbond) [15:56:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm::helm: install helmfile [puppet] - 10https://gerrit.wikimedia.org/r/773271 (https://phabricator.wikimedia.org/T304532) (owner: 10Majavah) [15:57:30] (03CR) 10David Caro: [C: 03+2] "Wow!" [puppet] - 10https://gerrit.wikimedia.org/r/773271 (https://phabricator.wikimedia.org/T304532) (owner: 10Majavah) [15:58:32] (03PS1) 10Majavah: kubeadm::helm: use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/773274 [15:58:36] (03PS1) 10Majavah: kubeadm::helm: configure default HELMFILE_ENVIRONMENT [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532) [15:59:45] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34523/console" [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532) (owner: 10Majavah) [16:00:25] (03PS33) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [16:00:29] (03PS1) 10Majavah: kubeadm::helm: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/773277 [16:00:31] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:00:38] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:01:51] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2033 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773250 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [16:02:00] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 72 probes of 675 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:02:09] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2031 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773251 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [16:03:13] (03PS2) 10Majavah: kubeadm::helm: use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/773274 [16:03:15] (03PS2) 10Majavah: kubeadm::helm: configure default HELMFILE_ENVIRONMENT [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532) [16:04:23] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1032.eqiad.wmnet with reason: host reimage [16:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:50] (03CR) 10David Caro: [C: 03+2] kubeadm::helm: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/773277 (owner: 10Majavah) [16:05:11] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2029 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773252 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [16:05:46] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1011 [puppet] - 10https://gerrit.wikimedia.org/r/773278 (https://phabricator.wikimedia.org/T300744) [16:05:55] 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10KFrancis) @jbond @TheDJ The agreement has been sent out for signatures. [16:06:04] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773253 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [16:06:19] Amir1, btw is there a specific reason why the 'post-creation' tasks are created with a custom edit policy? [16:07:21] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1032.eqiad.wmnet with reason: host reimage [16:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:25] zabe: i doubt it. Can you check the code? On phone atm [16:07:27] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1011 [puppet] - 10https://gerrit.wikimedia.org/r/773278 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:08:00] i can take a look [16:08:21] (03PS2) 10Jbond: O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272 [16:09:38] (03Abandoned) 10Ssingh: icinga: add ssingh to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/767530 (owner: 10Ssingh) [16:10:12] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1011.eqiad.wmnet with OS bullseye [16:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:18] (03CR) 10Jbond: "once this is in place we can update the command definitions to use this new check. this will prevent us from having to create dedicate mo" [puppet] - 10https://gerrit.wikimedia.org/r/773272 (owner: 10Jbond) [16:12:40] (03CR) 10David Caro: [C: 03+2] kubeadm::helm: configure default HELMFILE_ENVIRONMENT [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532) (owner: 10Majavah) [16:14:42] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 63 probes of 675 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:18:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubeadm::helm: use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/773274 (owner: 10Majavah) [16:18:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubeadm::helm: configure default HELMFILE_ENVIRONMENT [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532) (owner: 10Majavah) [16:18:39] (03PS3) 10Jbond: O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321) [16:18:41] (03PS1) 10Jbond: icinga: move client_auth_puppet_post to use wmf_check_http [puppet] - 10https://gerrit.wikimedia.org/r/773279 (https://phabricator.wikimedia.org/T304321) [16:19:48] (03CR) 10David Caro: "Got a question" [puppet] - 10https://gerrit.wikimedia.org/r/773274 (owner: 10Majavah) [16:19:58] (KubernetesCalicoDown) firing: kubernetes1011.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:20:58] (03PS1) 10David Caro: systemd:environment: fix typo in docs [puppet] - 10https://gerrit.wikimedia.org/r/773280 [16:21:19] (03PS10) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) [16:25:43] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1011.eqiad.wmnet with reason: host reimage [16:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1011.eqiad.wmnet with reason: host reimage [16:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:58] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1032.eqiad.wmnet with OS bullseye [16:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:03] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:07] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:34:54] (03CR) 10David Caro: [C: 03+2] systemd:environment: fix typo in docs [puppet] - 10https://gerrit.wikimedia.org/r/773280 (owner: 10David Caro) [16:39:58] (KubernetesCalicoDown) resolved: kubernetes1011.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:40:46] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1011.eqiad.wmnet with OS bullseye [16:40:47] jouncebot now [16:40:47] No deployments scheduled for the next 3 hour(s) and 19 minute(s) [16:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:00] I'm going to test /usr/local/bin/mwscript extensions/WikimediaMaintenance/dumpInterwiki.php on deploy1002 [16:42:40] (03PS1) 10Jbond: external_cloud_endors: ensure we sintall the python3-conftool dependency [puppet] - 10https://gerrit.wikimedia.org/r/773283 [16:43:12] (03CR) 10Jbond: [C: 03+2] external_cloud_endors: ensure we sintall the python3-conftool dependency [puppet] - 10https://gerrit.wikimedia.org/r/773283 (owner: 10Jbond) [16:44:55] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:39] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:49] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06452 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:46:51] RECOVERY - Check systemd state on sretest1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: Remove v2 config API support [puppet] - 10https://gerrit.wikimedia.org/r/772938 (https://phabricator.wikimedia.org/T303770) (owner: 10RLazarus) [16:48:48] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [16:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:29] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34524/console" [puppet] - 10https://gerrit.wikimedia.org/r/772938 (https://phabricator.wikimedia.org/T303770) (owner: 10RLazarus) [16:51:00] 10SRE: Adding snwachukwu@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T304541 (10Snwachukwu) [16:51:10] (03PS1) 10Cwhite: profile: Rsyslog omkafka configs use new ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) [16:53:31] (03CR) 10Majavah: kubeadm::helm: use systemd::environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773274 (owner: 10Majavah) [16:54:17] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:56:34] (03CR) 10David Caro: kubeadm::helm: use systemd::environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773274 (owner: 10Majavah) [16:58:26] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS bullseye [16:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:43] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1047.eqiad.wmnet with OS bullseye [16:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:04] (03CR) 10Majavah: kubeadm::helm: use systemd::environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773274 (owner: 10Majavah) [16:59:23] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1033.eqiad.wmnet with OS bullseye [16:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:43] jouncebot: refresh [17:01:44] I refreshed my knowledge about deployments. [17:01:52] jouncebot: nowandnext [17:01:52] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [17:01:52] In 0 hour(s) and 58 minute(s): 🚂🧪Trainsperiment Week Deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T1800) [17:02:50] * taavi needs to learn that ^W in a browser based client closes the tab instead of deleting that word [17:03:14] taavi: happens to me all the time on irccloud -_- [17:04:53] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:05:27] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:07:06] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1034.eqiad.wmnet with OS bullseye [17:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:51] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:10:44] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1028.eqiad.wmnet with reason: host reimage [17:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:55] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1028.eqiad.wmnet with reason: host reimage [17:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:19] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage [17:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:45] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4516 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:14:57] 10SRE, 10serviceops: Service puppet certificate due to expire - https://phabricator.wikimedia.org/T304543 (10jbond) p:05Triage→03High [17:17:40] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage [17:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:32] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1034.eqiad.wmnet with reason: host reimage [17:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:42] (03CR) 10RLazarus: [V: 03+1 C: 03+2] envoy: Remove v2 config API support [puppet] - 10https://gerrit.wikimedia.org/r/772938 (https://phabricator.wikimedia.org/T303770) (owner: 10RLazarus) [17:25:05] 10SRE, 10serviceops: Service puppet certificate due to expire - https://phabricator.wikimedia.org/T304543 (10jbond) [17:25:35] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10jbond) [17:25:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1034.eqiad.wmnet with reason: host reimage [17:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:02] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [17:26:08] 10SRE, 10Beta-Cluster-Infrastructure, 10envoy, 10serviceops, 10Patch-For-Review: Clean up Puppet support for Envoy v2 config API - https://phabricator.wikimedia.org/T303770 (10RLazarus) 05Open→03Resolved [17:26:38] (03CR) 10David Caro: [C: 03+2] kubeadm::helm: use systemd::environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773274 (owner: 10Majavah) [17:27:07] (03CR) 10David Caro: [C: 03+2] P:environment: enable export_systemd_env in cloud [puppet] - 10https://gerrit.wikimedia.org/r/771576 (owner: 10Jbond) [17:27:17] (03PS4) 10David Caro: P:environment: enable export_systemd_env in cloud [puppet] - 10https://gerrit.wikimedia.org/r/771576 (owner: 10Jbond) [17:27:22] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [17:27:49] (03CR) 10David Caro: [C: 03+2] "Merge whenever you are ready" [puppet] - 10https://gerrit.wikimedia.org/r/771576 (owner: 10Jbond) [17:28:49] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10jbond) >>! In T304237#7797420, @Volans wrote: >>>! In T304237#7797398, @JMeybohm wrote: >>>>! In T304237#7795994, @Volans wrote: >>>... [17:31:09] (03PS4) 10Jbond: O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321) [17:32:17] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1028.eqiad.wmnet with OS bullseye [17:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:19] (03PS6) 10Giuseppe Lavagetto: Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) [17:38:35] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1033.eqiad.wmnet with OS bullseye [17:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:02] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:46:46] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:47:48] jouncebot now [17:47:48] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [17:48:15] !log trainsperiment (T300203): starting prep for 1.39.0-wmf.4 [17:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:56] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [17:50:45] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1034.eqiad.wmnet with OS bullseye [17:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:24] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773293 [17:51:26] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773293 (owner: 10Brennen Bearnes) [17:52:28] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773293 (owner: 10Brennen Bearnes) [17:52:32] !log brennen@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.4 refs T300203 [17:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:31] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) 05Open→03In progress [17:55:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:55:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:52] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34525/console" [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) (owner: 10Cwhite) [17:59:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] dancy, hashar, brennen, dduvall, jeena, and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for 🚂🧪Trainsperiment Week Deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T1800). [18:00:48] In progress! [18:01:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34526/console" [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) (owner: 10Cwhite) [18:02:04] (03CR) 10Elukey: [V: 03+1 C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) (owner: 10Cwhite) [18:02:56] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS2914/IPv6: Idle - NTT, AS2914/IPv4: Idle - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:04:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:18] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.371 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:05:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:05:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:39] (03PS3) 10Majavah: kubeadm::helm: use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/773274 [18:05:41] (03PS3) 10Majavah: kubeadm::helm: configure default HELMFILE_ENVIRONMENT [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532) [18:05:54] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:06:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:52] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 59, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:06:54] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:07:05] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773309 (owner: 10RhinosF1) [18:10:21] 10SRE, 10Data-Engineering: Adding snwachukwu@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T304541 (10Ottomata) [18:14:00] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:15:12] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 60, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:15:14] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:17:06] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.371 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:25:03] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) PT was pretty smooth, ES likely to be later today, closer to when their daily traffic cycle begins to trend downwards. [18:25:11] (03PS1) 10Jcrespo: swift: Create a new read-only role on mw account for backup taking [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T169144) [18:25:30] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:28:12] (03PS3) 10Bking: elasticsearch: upgrade cloudelastic to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763481 (https://phabricator.wikimedia.org/T301956) (owner: 10Gehel) [18:28:59] (03PS1) 10Cathal Mooney: Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) [18:29:01] (03PS2) 10Jcrespo: swift: Create a new read-only role on mw account for backup taking [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) [18:29:44] (03CR) 10jerkins-bot: [V: 04-1] Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) (owner: 10Cathal Mooney) [18:31:10] (03CR) 10Ebernhardson: [C: 03+1] elasticsearch: upgrade cloudelastic to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763481 (https://phabricator.wikimedia.org/T301956) (owner: 10Gehel) [18:31:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:10] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4194 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:32:59] (03CR) 10Jcrespo: "Initial patch to start a conversation." [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) (owner: 10Jcrespo) [18:36:33] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1035.eqiad.wmnet with OS bullseye [18:36:35] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1036.eqiad.wmnet with OS bullseye [18:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:37:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:25] (03PS2) 10Cathal Mooney: Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) [18:38:44] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3548 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:42:13] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.4 refs T300203 (duration: 49m 41s) [18:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:18] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [18:43:09] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10wmfdata-python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) p:05Medium→03Low Unclear whether or not we want this logic to live in Wmfdata-Python; i... [18:43:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:37] !log brennen@deploy1002 Pruned MediaWiki: 1.38.0-wmf.26 (duration: 02m 05s) [18:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) [18:47:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) [18:47:30] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:47:35] !log trainsperiment (T300203): 1.39.0-wmf.4 on testwikis; proceeding to groups 0-2 with 15 minute intervals for watching logs [18:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:39] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [18:48:07] (03PS3) 10Cathal Mooney: Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) [18:48:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) @cmooney i have connected spine switches to scs and updated netbox [18:48:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:04] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) Hmm, the 1.21.1 build didn't work out of the box. Running `build-envoy-deb buster future` got me this: ` [...] ./ci/run_envoy_docker.sh ./ci/do_ci.sh b... [18:50:25] brennen: what's going on with the php-fpm alert above [18:50:41] that's been noisy the last day [18:50:58] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1035.eqiad.wmnet with reason: host reimage [18:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:31] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1036.eqiad.wmnet with reason: host reimage [18:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:52:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:07] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.4 refs T300203 [18:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:12] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [18:54:32] (03CR) 10Bking: [C: 03+2] elasticsearch: upgrade cloudelastic to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763481 (https://phabricator.wikimedia.org/T301956) (owner: 10Gehel) [18:55:01] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1035.eqiad.wmnet with reason: host reimage [18:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:52] (03PS1) 10Arlolra: Add wikimedia.com to wgNoFollowDomainExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773302 (https://phabricator.wikimedia.org/T304555) [18:56:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:31] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1036.eqiad.wmnet with reason: host reimage [18:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:59] (03PS1) 10Brennen Bearnes: group0 wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773304 [18:57:00] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773304 (owner: 10Brennen Bearnes) [18:57:30] (bit of weirdness trying out new `scap deploy-promote` above; this sync should effectively be a no-op.) [18:57:42] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773304 (owner: 10Brennen Bearnes) [18:57:59] RhinosF1: good question [18:58:53] (03CR) 10RLazarus: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/773205 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:59:21] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.4 refs T300203 [18:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:26] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [19:00:12] brennen: unfortunately I don't have a good answer to go with it [19:01:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:02:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:11] (03PS1) 10Brennen Bearnes: group1 wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773326 [19:04:13] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773326 (owner: 10Brennen Bearnes) [19:04:58] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773326 (owner: 10Brennen Bearnes) [19:06:24] (03PS1) 10Jbond: P:puppetdb: Add status page functionality to / [puppet] - 10https://gerrit.wikimedia.org/r/773327 [19:07:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34527/console" [puppet] - 10https://gerrit.wikimedia.org/r/773327 (owner: 10Jbond) [19:08:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:14] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.4 refs T300203 [19:08:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetdb: Add status page functionality to / [puppet] - 10https://gerrit.wikimedia.org/r/773327 (owner: 10Jbond) [19:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:19] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [19:08:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:08:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:07] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.4 refs T300203 (duration: 00m 52s) [19:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:19] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:09:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:50] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic ES 6.8 upgrade - bking@cumin1001 - T301956 [19:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:55] T301956: Upgrade cloudelastic to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301956 [19:20:35] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1035.eqiad.wmnet with OS bullseye [19:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:40] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Unify loopback filters between CR routers and L3 switches - https://phabricator.wikimedia.org/T304553 (10cmooney) To clarify the 'port' isn't an option on QFX even for UDP, although it allows you to define a term with that. So I've changed... [19:20:51] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:20:54] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1036.eqiad.wmnet with OS bullseye [19:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:57] (03PS1) 10Brennen Bearnes: all wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773330 [19:20:59] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773330 (owner: 10Brennen Bearnes) [19:21:10] (03PS1) 10Jdlrobson: Enable split A/B testing on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) [19:22:05] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773330 (owner: 10Brennen Bearnes) [19:23:16] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic ES 6.8 upgrade - bking@cumin1001 - T301956 [19:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:24] T301956: Upgrade cloudelastic to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301956 [19:23:36] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.4 refs T300203 [19:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:45] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [19:24:18] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:25:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:25:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:04] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03226 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:37:24] !log trainsperiment (T300203): 1.39.0-wmf.4 on all wikis; logs seem clean - end of train deployment activities for the week, unless bugs emerge [19:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:29] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [19:38:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:38:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:20] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:41:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @jclark-ctr super thanks for that! I'll open a task and start planning how we take care of the move. [19:44:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:15] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic ES 6.8 upgrade - bking@cumin1001 - T301956 [19:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:21] T301956: Upgrade cloudelastic to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301956 [19:58:38] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) a:05RobH→03MoritzMuehlenhoff >>! In T297913#7788208, @MoritzMuehlenhoff wrote: > dumpsdata1007 is now running 5.16.11, can you please retest? > > I'm not familiar with perccli myself, if there... [20:00:05] RoanKattouw and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T2000). [20:00:05] bd808 and Tran: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:33] o/ [20:01:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr I'm not getting any output on port 20 or 29 of the scs-f8. Are the two Junipers powered on? If not can you double c... [20:01:57] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic ES 6.8 upgrade - bking@cumin1001 - T301956 [20:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:02] T301956: Upgrade cloudelastic to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301956 [20:03:15] I suppose I could technically do the deployment, but I haven't done scap things in quite some time so I would be more than happy to have RoanKattouw or urbanecm drive if they have time. [20:04:15] Yeah I can drive [20:05:17] (03PS3) 10Catrope: wikitech: Remove DynamicSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:05:21] (03CR) 10Catrope: [C: 03+2] wikitech: Remove DynamicSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:05:59] thanks much RoanKattouw [20:06:06] (03Merged) 10jenkins-bot: wikitech: Remove DynamicSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:06:32] (03CR) 10Clare Ming: [C: 03+1] Enable split A/B testing on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) (owner: 10Jdlrobson) [20:07:36] bd808: first patch is ready for testing on mwdebug1002 [20:07:58] When you give me the go-ahead, I'll deploy it and queue up the next one [20:08:33] RoanKattouw: I verified that enwiki and mw.o still load. That's about all that I can test via mwdebug for wikitech things. [20:08:59] Ok [20:09:09] I don't have any fear of us crashing wikitech with these changes. Just of borking config in generall [20:09:27] (03PS4) 10Catrope: DynamicSidebar: remove from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:09:36] (03CR) 10Catrope: [C: 03+2] DynamicSidebar: remove from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:09:50] Makes sense [20:09:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:06] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:771443|wikitech: Remove DynamicSidebar (T304006)]] (duration: 00m 52s) [20:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:10] T304006: Undeploy DynamicSidebar extension from Wikimedia wikis (only Wikitech) - https://phabricator.wikimedia.org/T304006 [20:10:22] (03Merged) 10jenkins-bot: DynamicSidebar: remove from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:11:29] bd808: Alright, next one up for testing, I assume it's the same thing of only being able to test that production wikis are still up [20:12:20] RoanKattouw: yes, and the smoke tests look good to me. enwiki and mw.o again [20:13:36] !log catrope@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:771444|DynamicSidebar: remove from CommonSettings (T304006)]] (duration: 00m 50s) [20:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:14:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:26] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1037.eqiad.wmnet with OS bullseye [20:14:27] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1038.eqiad.wmnet with OS bullseye [20:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:31] (03PS3) 10Catrope: DynamicSidebar: remove from InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771447 (owner: 10BryanDavis) [20:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:44] (03CR) 10Catrope: [C: 03+2] DynamicSidebar: remove from InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771447 (owner: 10BryanDavis) [20:15:53] (03Merged) 10jenkins-bot: DynamicSidebar: remove from InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771447 (owner: 10BryanDavis) [20:17:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:14] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic ES 6.8 upgrade - bking@cumin1001 - T301956 [20:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:18] T301956: Upgrade cloudelastic to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301956 [20:22:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:23:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:38] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: host reimage [20:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:51] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1037.eqiad.wmnet with reason: host reimage [20:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:59] bd808: ok next one is up for testing [20:31:11] Sorry for the delay, I had to deal with a rebase conflict [20:32:00] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1039.eqiad.wmnet with OS bullseye [20:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:06] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: host reimage [20:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:14] RoanKattouw: smoke tests passed. ship it :) [20:33:51] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1037.eqiad.wmnet with reason: host reimage [20:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:16] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:771447|DynamicSidebar: remove from InitialiseSettings]] (duration: 00m 51s) [20:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:55] (I'm here for my set of patches but need to restart real quick sorry!) [20:35:07] (03PS1) 10SBassett: Set StopForumSpam to enforce on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773340 (https://phabricator.wikimedia.org/T304111) [20:35:13] (03PS3) 10Catrope: DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448 (owner: 10BryanDavis) [20:35:15] (03CR) 10Catrope: [C: 03+2] DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448 (owner: 10BryanDavis) [20:35:29] (03PS4) 10Catrope: DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:35:31] (03CR) 10Catrope: [C: 03+2] DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:37:13] (03Merged) 10jenkins-bot: DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:38:22] PROBLEM - ensure kvm processes are running on cloudvirt1037 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.79: Connection reset by peer https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:40:34] PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.79. Check system logs on 10.64.20.79 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:40:35] !log catrope@deploy1002 Synchronized wmf-config/extension-list: Config: [[gerrit:771448|DynamicSidebar: remove unused extension (T304006)]] (duration: 00m 49s) [20:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:40] T304006: Undeploy DynamicSidebar extension from Wikimedia wikis (only Wikitech) - https://phabricator.wikimedia.org/T304006 [20:41:00] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:41:34] hi - I missed adding a config patch to this deployment window -- I'm happy to do it after the scheduled deployments are done if it's ok. It's config for beta cluster - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/773331 [20:43:26] PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt1037 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.79: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [20:44:00] PROBLEM - nova-compute proc maximum on cloudvirt1037 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.79: Connection reset by peer https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:44:16] (03CR) 10Reedy: [C: 04-1] Enable split A/B testing on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) (owner: 10Jdlrobson) [20:44:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:36] cjming: go for it [20:45:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:45:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:59] could we still do the ip info deploys too? [20:46:03] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage [20:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:11] bd808: ok I think we're done, the docs say to also remove the repo from the make-wmf-branch script, but that script seems to have moved [20:46:27] So not sure what to do there, I'll ask in the releng channel [20:46:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:48] Tran: Yes I'll do yours next [20:46:51] thank you! [20:46:53] RoanKattouw: mediawiki/tools/release [20:47:01] Sorry for the delay, I was trying to find my way through outdated docs [20:47:18] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:47:25] RoanKattouw: ack. I can take care of the make-wmf-branch bits too. Thanks for the deploy work! [20:47:36] Reedy: Sure but make-wmf-branch doesn't exist there anymore [20:47:42] And I don't see a list of extensions in that repo [20:48:05] https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-release/settings.yaml [20:48:32] RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:48:53] (03PS2) 10Catrope: Allow autoconfirmed users to view basic IP information [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders) [20:48:55] (03PS2) 10Clare Ming: Enable split A/B testing on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) (owner: 10Jdlrobson) [20:48:59] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage [20:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:09] (03CR) 10Catrope: [C: 03+2] Allow autoconfirmed users to view basic IP information [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders) [20:49:52] thanks RoanKattouw - I'll wait til you're done - no rush [20:49:53] (03Merged) 10jenkins-bot: Allow autoconfirmed users to view basic IP information [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders) [20:49:53] Reedy: Thanks, I'll update the docs [20:51:26] RECOVERY - nova-compute proc maximum on cloudvirt1037 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:51:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:51:44] Reedy: should I remove it from make-tarball-release too? I'm not sure what the inclusion criteria is there. [20:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:03] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1038.eqiad.wmnet with OS bullseye [20:52:04] bd808: I don't use that script [20:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:22] I suspect it's rotten [20:52:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:52:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:47] RECOVERY - ensure kvm processes are running on cloudvirt1037 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:52:48] RECOVERY - Check the NTP synchronisation status of timesyncd on cloudvirt1037 is OK: OK: synced at Wed 2022-03-23 20:52:46 UTC. https://wikitech.wikimedia.org/wiki/NTP [20:53:06] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1037.eqiad.wmnet with OS bullseye [20:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:16] Tran: Your change is ready for testing on mwdebug1002, please test [20:53:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:23] RoanKattouw I think I may have messed up the order of operations. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/767216 might have to go in as well so that IPInfo is actually enabled on testwiki. [20:54:45] Oh I see [20:54:59] Sorry 🙇‍♂️ [20:55:05] (03PS6) 10Catrope: Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders) [20:55:09] (03CR) 10Catrope: [C: 03+2] Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders) [20:55:55] (03Merged) 10jenkins-bot: Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders) [20:56:36] Tran: OK try now [20:58:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:09] I can confirm that the extension is installed and the groups have the rights I expect [21:00:33] So good? I think? Something else elsewhere is not what I expect but these patches have done what they should [21:01:06] What is not how you expect and what would it take to fix it? [21:01:42] Hm I thought we enabled IP Info on BetaFeatures earlier but I can't find it in my Special:Preferences [21:01:55] Ideally I would have been able to e2e test this as well by enabling it and confirming I could use the feature [21:02:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:02:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:15] Oh you might have to add it to the list of BetaFeatures [21:04:47] See the wgBetaFeaturesWhitelist (sic) setting [21:05:31] ( Tran ) [21:05:59] oh nooooo I remember now. I think we were still doing that. Okay yes the patches do what I expect and unfortunately, iirc now, we have not yet finished adding IPInfo as a beta feature [21:06:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:34] noting that i might be rolling the train back for T304564 [21:06:34] T304564: MWException: `[title]` is not a valid file title. - https://phabricator.wikimedia.org/T304564 [21:06:43] (after deploy window is clear) [21:08:58] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1039.eqiad.wmnet with OS bullseye [21:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:07] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:10:43] Actually, RoanKattouw I think we have enabled it? `wmgUseIPInfo` has the `'testwiki' => true, // T260598` key [21:10:45] T260598: Deploy IP Info extension to test.wikipedia.org - https://phabricator.wikimedia.org/T260598 [21:11:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:34] Tran: Right, so IPInfo will be enabled on testwiki once I deploy this, but the BetaFeature will not be [21:11:42] Is that right? Should I pull the trigger and deploy? [21:12:50] I don't think it hurts to deploy? It does as expected and it doesn't break anything. It's just that w/o it being a BetaFeature, it won't be visible to users. [21:13:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:13:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:02] OK, then I'll deploy now [21:14:10] thank you! [21:15:30] !log catrope@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:772408|Allow autoconfirmed users to view basic IP information (T303858)]] and [[gerrit:767216|Enable IPInfo on testwiki (T260598)]] (duration: 00m 50s) [21:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:37] T303858: Make IP Info available to all users in the 'autoconfirmed' group on testwiki - https://phabricator.wikimedia.org/T303858 [21:15:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:36] Alright we're all done [21:18:03] cjming: Feel free to +2 your labs-only change now, and after that brennen can take over and do the train rollback [21:18:13] thanks again! [21:18:13] will do - thanks Roan! [21:18:30] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1040.eqiad.wmnet with OS bullseye [21:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:46] (03CR) 10Clare Ming: [C: 03+2] Enable split A/B testing on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) (owner: 10Jdlrobson) [21:19:12] i may hold rollback unless it recurs (last was 20:55 UTC), but thanks for ping. [21:19:47] (03Merged) 10jenkins-bot: Enable split A/B testing on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) (owner: 10Jdlrobson) [21:24:00] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:773331|Enable split A/B testing on beta cluster (T301584)]] (duration: 00m 50s) [21:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:06] T301584: Add rich snippet instrument to WikimediaEvents - https://phabricator.wikimedia.org/T301584 [21:24:40] alrighty I'm done too [21:25:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:26:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:46] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage [21:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:17] (03PS1) 10STran: Add IPInfo to BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) [21:35:20] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage [21:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:07] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic ES 6.8 upgrade - bking@cumin1001 - T301956 [21:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:12] T301956: Upgrade cloudelastic to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301956 [21:48:30] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:55:18] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1040.eqiad.wmnet with OS bullseye [21:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:22] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bullseye [22:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:38] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1042.eqiad.wmnet with OS bullseye [22:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:55] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:15:22] (03CR) 10Krinkle: [C: 03+1] "Good to go. Thank you Zabe!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771469 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [22:18:48] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage [22:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:13] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: host reimage [22:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:48] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage [22:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:13] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: host reimage [22:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:49] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:26:51] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:27:53] (03PS1) 10Brennen Bearnes: Revert "Handle broken media and thumb error in the same case for gallery" [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773314 [22:28:55] (03CR) 10Jdlrobson: Enable split A/B testing on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) (owner: 10Jdlrobson) [22:33:34] (03Abandoned) 10Brennen Bearnes: Revert "Handle broken media and thumb error in the same case for gallery" [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773314 (owner: 10Brennen Bearnes) [22:36:01] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bullseye [22:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:04] (03PS1) 10Brennen Bearnes: Revert 2 media gallery changes [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773317 (https://phabricator.wikimedia.org/T304564) [22:47:05] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1041.eqiad.wmnet with OS bullseye [22:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:07] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1042.eqiad.wmnet with OS bullseye [22:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:16] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage [22:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:30] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:16] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage [22:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:58] (03PS2) 10Krinkle: parser: Revert 2 media gallery changes [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773317 (https://phabricator.wikimedia.org/T304564) (owner: 10Brennen Bearnes) [22:56:02] (03CR) 10Krinkle: [C: 03+1] parser: Revert 2 media gallery changes [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773317 (https://phabricator.wikimedia.org/T304564) (owner: 10Brennen Bearnes) [23:16:02] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1043.eqiad.wmnet with OS bullseye [23:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:18] !log remove openjdk-8-jre from codfw logstash nodes T301770 [23:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:23] T301770: Remove obsolete Java 8 packages from logstash cluster - https://phabricator.wikimedia.org/T301770 [23:34:03] !log trainsperiment (T300203): reverting to 1.39.0-wmf.3 on all wikis for T304564; will move forward again after a fix. [23:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:09] T304564: MWException: `[title]` is not a valid file title. - https://phabricator.wikimedia.org/T304564 [23:34:09] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [23:35:07] (03PS1) 10Brennen Bearnes: all wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773363 [23:35:09] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773363 (owner: 10Brennen Bearnes) [23:36:26] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773363 (owner: 10Brennen Bearnes) [23:38:02] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.3 refs T300203 [23:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:11] (03PS1) 10RLazarus: envoy: Move upstream HTTP config into the new HttpProtocolOptions message [puppet] - 10https://gerrit.wikimedia.org/r/773364 (https://phabricator.wikimedia.org/T303230) [23:40:36] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [23:40:46] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) 05Stalled→03In progress p:05Low→03Medium [23:43:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:45:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:38] (03CR) 10Brennen Bearnes: [C: 04-2] parser: Revert 2 media gallery changes [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773317 (https://phabricator.wikimedia.org/T304564) (owner: 10Brennen Bearnes) [23:46:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:32] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1045.eqiad.wmnet with OS bullseye [23:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:37] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bullseye [23:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:28] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bullseye [23:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:59:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log