[00:11:18] <cjming>	 !log end running skin preference update script T299104
[00:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:11:23] <stashbot>	 T299104: Prepare script to update invalid user preferences after skins have been separated - https://phabricator.wikimedia.org/T299104
[00:25:45] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudvirt1024: update nic ids and set legacy_vlan_naming: false [puppet] - 10https://gerrit.wikimedia.org/r/772947
[00:27:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1024: update nic ids and set legacy_vlan_naming: false [puppet] - 10https://gerrit.wikimedia.org/r/772947 (owner: 10Andrew Bogott)
[00:39:47] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1024 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[01:32:59] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:00:04] <jouncebot>	 Deploy window Automatic 🚂🧪Trainsperiment Week branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T0200)
[02:07:26] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.4 [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/772965
[02:07:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.4 [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/772965 (owner: 10TrainBranchBot)
[02:07:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:07:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:08:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:08:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:08:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:08:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:09:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:09:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:30] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.4 [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/772965 (owner: 10TrainBranchBot)
[02:29:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:29:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:29:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:29:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:30:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:30:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:30:35] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[02:30:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:34:47] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:37:29] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[03:05:37] <icinga-wm>	 PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:46:15] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] role::kafka::logging: add PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[04:41:07] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10odimitrijevic)
[05:02:29] <wikibugs>	 10SRE, 10GitLab, 10Horizon, 10wikitech.wikimedia.org, 10Security: Take some pointers from GitHub security updates - https://phabricator.wikimedia.org/T304231 (10hashar)
[05:08:11] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:09:11] <icinga-wm>	 RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:47:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[05:53:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1158 - https://phabricator.wikimedia.org/T303910 (10Marostegui) Thank you Chris, the RAID is back to optimal
[05:57:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:02:05] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:03:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1132 with low weight T301879', diff saved to https://phabricator.wikimedia.org/P22995 and previous config saved to /var/cache/conftool/dbconfig/20220323-060351-marostegui.json
[06:03:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:58] <stashbot>	 T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879
[06:05:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 for reimage', diff saved to https://phabricator.wikimedia.org/P22996 and previous config saved to /var/cache/conftool/dbconfig/20220323-060533-marostegui.json
[06:05:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:18] <wikibugs>	 (03PS1) 10Marostegui: db1112: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/773120 (https://phabricator.wikimedia.org/T300600)
[06:07:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1112: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/773120 (https://phabricator.wikimedia.org/T300600) (owner: 10Marostegui)
[06:09:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1112.eqiad.wmnet with OS bullseye
[06:09:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:08] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1112.eqiad.wmnet with reason: host reimage
[06:18:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "This looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/771462 (https://phabricator.wikimedia.org/T301674) (owner: 10Zabe)
[06:20:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1112.eqiad.wmnet with reason: host reimage
[06:21:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:21:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] wmcs: stop accessing gu_hidden in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/760953 (https://phabricator.wikimedia.org/T289068) (owner: 10Zabe)
[06:24:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmcs: stop accessing gu_hidden in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/760953 (https://phabricator.wikimedia.org/T289068) (owner: 10Zabe)
[06:24:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmcs: stop accessing gu_enabled and gu_enabled_method in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/771462 (https://phabricator.wikimedia.org/T301674) (owner: 10Zabe)
[06:25:00] <wikibugs>	 (03PS2) 10Marostegui: wmcs: stop accessing gu_enabled and gu_enabled_method in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/771462 (https://phabricator.wikimedia.org/T301674) (owner: 10Zabe)
[06:26:27] <wikibugs>	 (03CR) 10Marostegui: [V: 03+2 C: 03+2] wmcs: stop accessing gu_enabled and gu_enabled_method in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/771462 (https://phabricator.wikimedia.org/T301674) (owner: 10Zabe)
[06:34:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1112.eqiad.wmnet with OS bullseye
[06:34:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:49] <wikibugs>	 (03PS1) 10Marostegui: drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658)
[06:37:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) (owner: 10Marostegui)
[06:37:21] <wikibugs>	 (03PS2) 10Marostegui: drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658)
[06:37:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) (owner: 10Marostegui)
[06:38:41] <wikibugs>	 (03PS3) 10Marostegui: drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658)
[06:41:36] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:42:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:42:45] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1112: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/772895
[06:43:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1112: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/772895 (owner: 10Marostegui)
[06:43:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P22997 and previous config saved to /var/cache/conftool/dbconfig/20220323-064353-root.json
[06:43:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: (3) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:49:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:58:14] <wikibugs>	 10ops-eqiad, 10serviceops: mc1053 PS redundancy alert - https://phabricator.wikimedia.org/T304477 (10elukey)
[06:58:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:58:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P22998 and previous config saved to /var/cache/conftool/dbconfig/20220323-065856-root.json
[06:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:04] <jouncebot>	 Amir1, awight, Urbanecm, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T0700).
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:25] <urbanecm>	 indeed, nothing to do
[07:01:22] <wikibugs>	 (03PS4) 10Elukey: Initial debianization of istio-cni [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612)
[07:02:47] <wikibugs>	 (03CR) 10Elukey: Initial debianization of istio-cni (033 comments) [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[07:14:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P22999 and previous config saved to /var/cache/conftool/dbconfig/20220323-071400-root.json
[07:14:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P23000 and previous config saved to /var/cache/conftool/dbconfig/20220323-072904-root.json
[07:29:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM! We can deploy anytime" [deployment-charts] - 10https://gerrit.wikimedia.org/r/772811 (https://phabricator.wikimedia.org/T300270) (owner: 10AikoChou)
[07:44:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P23001 and previous config saved to /var/cache/conftool/dbconfig/20220323-074408-root.json
[07:44:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:52] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1009 [puppet] - 10https://gerrit.wikimedia.org/r/773181 (https://phabricator.wikimedia.org/T300744)
[07:48:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1009 [puppet] - 10https://gerrit.wikimedia.org/r/773181 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[07:54:35] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1009.eqiad.wmnet with OS bullseye
[07:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:04] <jouncebot>	 dancy, hashar, brennen, dduvall, jeena, and jnuche: Time to snap out of that daydream and deploy 🚂🧪Trainsperiment Week Deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T0800).
[08:00:05] <jouncebot>	 dancy, hashar, brennen, dduvall, jeena, and jnuche: (Dis)respected human, time to deploy 🚂🧪Trainsperiment Week Deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T0800). Please do the needful.
[08:00:13] <wikibugs>	 (03PS14) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[08:03:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1009.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:04:04] <wikibugs>	 10SRE, 10SRE Observability: thanos: 404 error trying to fetch js library - https://phabricator.wikimedia.org/T269000 (10fgiunchedi) 05Open→03Declined Declining because this is indeed harmless and we're not looking at having sourcemaps for thanos
[08:05:01] <wikibugs>	 (03PS15) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[08:06:37] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34503/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey)
[08:09:11] <wikibugs>	 10SRE, 10observability, 10SRE Observability (FY2021/2022-Q3), 10Sustainability (Incident Followup), 10User-fgiunchedi: Unquoted URL parameter - https://phabricator.wikimedia.org/T304323 (10fgiunchedi)
[08:09:23] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), and 2 others: Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10fgiunchedi)
[08:10:06] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1009.eqiad.wmnet with reason: host reimage
[08:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:20] <wikibugs>	 (03PS16) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[08:12:51] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1009.eqiad.wmnet with reason: host reimage
[08:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:03] <wikibugs>	 (03PS17) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[08:16:42] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34505/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey)
[08:19:24] <wikibugs>	 (03PS18) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[08:20:28] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34506/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey)
[08:21:40] <wikibugs>	 (03Abandoned) 10Razzi: kafka-main: add kafka-main200[45] to the codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/520465 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron)
[08:22:08] <wikibugs>	 (03PS19) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[08:23:17] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34507/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey)
[08:23:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1009.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:24:43] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1009.eqiad.wmnet with OS bullseye
[08:24:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:06] <wikibugs>	 (03PS8) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117)
[08:27:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[08:27:53] <wikibugs>	 (03PS20) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[08:28:55] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34508/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey)
[08:29:58] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp1079 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772793 (https://phabricator.wikimedia.org/T290005)
[08:31:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[08:36:33] <mmandere>	 !log depool cp1079 for reimage - T290005
[08:36:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:38] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[08:40:18] <wikibugs>	 (03PS1) 10Hashar: mediawiki::php::monitoring: dupe def PHP_VERSION [puppet] - 10https://gerrit.wikimedia.org/r/773184 (https://phabricator.wikimedia.org/T301945)
[08:40:39] <wikibugs>	 (03PS21) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612)
[08:40:41] <wikibugs>	 (03PS1) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185
[08:41:10] <wikibugs>	 (03CR) 10Hashar: "That should remove 70k/h php notices from logstash ;)" [puppet] - 10https://gerrit.wikimedia.org/r/773184 (https://phabricator.wikimedia.org/T301945) (owner: 10Hashar)
[08:41:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 (owner: 10Elukey)
[08:42:37] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34509/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[08:43:08] <wikibugs>	 (03PS9) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117)
[08:43:48] <moritzm>	 !log installing openssl security updates
[08:43:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:20] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:45:02] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:45:28] <wikibugs>	 (03PS2) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185
[08:46:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 (owner: 10Elukey)
[08:46:34] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[08:47:44] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1079.eqiad.wmnet with OS buster
[08:47:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:52] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1079.eqiad.wmnet with OS buster
[08:48:06] <wikibugs>	 (03PS3) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185
[08:49:09] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34512/console" [puppet] - 10https://gerrit.wikimedia.org/r/773185 (owner: 10Elukey)
[08:50:44] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "This is an example of how the istio-cni config could be easily chained:" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[08:51:29] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Search broken on puppetboard - https://phabricator.wikimedia.org/T304484 (10ayounsi) p:05Triage→03Low
[08:51:33] <logmsgbot>	 !log mmandere@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1079.eqiad.wmnet with OS buster
[08:51:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:41] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1079.eqiad.wmnet with OS buster exe...
[08:51:49] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp1079 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772793 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[08:54:04] <moritzm>	 !log restarting spamassassin/clamav on otrs1001/ticket.wikimedia.org
[08:54:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:41] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1079.eqiad.wmnet with OS buster
[08:54:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:49] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1079.eqiad.wmnet with OS buster
[08:57:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "I am not super familiar with the scaffold/etc.. configs but LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/770556 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[08:59:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[08:59:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:50] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp1081 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773188 (https://phabricator.wikimedia.org/T290005)
[09:03:34] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:04:33] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[09:06:01] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[09:06:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:49] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp1081 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773188 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[09:09:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Move miscweb from it's own LVS VIP to k8s-ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/770506 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[09:10:44] <wikibugs>	 (03CR) 10Elukey: "Looks good, I'd also ask a quick review to Traffic for confirmation/awareness of the change." [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[09:11:18] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1079.eqiad.wmnet with reason: host reimage
[09:11:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:12] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1079.eqiad.wmnet with reason: host reimage
[09:15:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:09] <icinga-wm>	 PROBLEM - Check systemd state on db1169 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:17:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[09:17:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for uwsgi/graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/773190 (https://phabricator.wikimedia.org/T135991)
[09:21:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable profile::auto_restarts::service for uwsgi/graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/773190 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:23:01] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for uwsgi/graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/773190 (https://phabricator.wikimedia.org/T135991)
[09:23:23] <wikibugs>	 (03PS1) 10JMeybohm: Update miscweb to latest scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/773191 (https://phabricator.wikimedia.org/T290966)
[09:24:52] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[09:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:38] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1079.eqiad.wmnet with OS buster
[09:36:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:47] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1079.eqiad.wmnet with OS buster com...
[09:39:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/772869 (https://phabricator.wikimedia.org/T304321) (owner: 10Filippo Giunchedi)
[09:39:54] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Test mediabackups updates on testwiki only [puppet] - 10https://gerrit.wikimedia.org/r/773192 (https://phabricator.wikimedia.org/T299764)
[09:40:16] <wikibugs>	 (03PS2) 10Jcrespo: mediabackups: Test mediabackups updates on testwiki only [puppet] - 10https://gerrit.wikimedia.org/r/773192 (https://phabricator.wikimedia.org/T299764)
[09:41:00] <wikibugs>	 10ops-eqiad: asw2-b-eqiad:FPC5 <-> FPC7 link down - https://phabricator.wikimedia.org/T304488 (10ayounsi) p:05Triage→03High
[09:42:45] <wikibugs>	 10ops-eqiad: asw2-b-eqiad:FPC5 <-> FPC7 link down - https://phabricator.wikimedia.org/T304488 (10ayounsi)
[09:43:06] <mmandere>	 !log pool cp1079 with HAProxy as TLS termination layer - T290005
[09:43:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:12] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[09:46:10] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper virtual chassis ports on asw2-b-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 ayounsi https://phabricator.wikimedia.org/T304488 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[09:46:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php::monitoring: dupe def PHP_VERSION [puppet] - 10https://gerrit.wikimedia.org/r/773184 (https://phabricator.wikimedia.org/T301945) (owner: 10Hashar)
[09:47:11] <hashar>	 _joe_: to be fair I have no idea why we had `define( 'PHP_VERSION', php_version() );`   maybe it had a specific purpose :\
[09:47:48] <_joe_>	 hashar: it didn't, it was part of a huge patch series to introduce multiple php engines at the same time, it slipped
[09:47:50] <hashar>	 once puppet ran on the host we should see a drop at https://logstash.wikimedia.org/goto/5967d326a61573afd237736c95d08a01
[09:47:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for nginx/htmldumps [puppet] - 10https://gerrit.wikimedia.org/r/772335 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:47:57] <hashar>	 ah good
[09:48:02] <_joe_>	 I was writing 5 languages in the same patches, that kind of stuff
[09:48:14] <hashar>	 I noticed that when opening logstash which shows the unfiltered event at that one standed out this morning :]
[09:48:19] <hashar>	 ahah
[09:48:25] <hashar>	 too many languages
[09:48:49] <_joe_>	 yeah you have puppet, ruby for the templates, bash, php, and some go-langish dsl for mtail, and ofc python 
[09:50:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Switch service type to ClusterIP in case Ingress is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/770556 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[09:51:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update miscweb to latest scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/773191 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[09:54:49] <wikibugs>	 (03Merged) 10jenkins-bot: Switch service type to ClusterIP in case Ingress is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/770556 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[09:56:27] <wikibugs>	 (03PS2) 10JMeybohm: Update miscweb to latest scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/773191 (https://phabricator.wikimedia.org/T290966)
[09:56:35] <mmandere>	 !log depool cp1081 for reimage - T290005
[09:56:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:43] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[10:01:08] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp1081 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773188 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:04:09] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:07:55] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1081.eqiad.wmnet with OS buster
[10:07:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:05] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1081.eqiad.wmnet with OS buster
[10:08:42] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1010 [puppet] - 10https://gerrit.wikimedia.org/r/773193 (https://phabricator.wikimedia.org/T300744)
[10:18:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1132 some more weight T301879', diff saved to https://phabricator.wikimedia.org/P23002 and previous config saved to /var/cache/conftool/dbconfig/20220323-101816-marostegui.json
[10:18:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:22] <stashbot>	 T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879
[10:21:56] <wikibugs>	 (03PS31) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454)
[10:22:29] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mediabackups: Test mediabackups updates on testwiki only [puppet] - 10https://gerrit.wikimedia.org/r/773192 (https://phabricator.wikimedia.org/T299764) (owner: 10Jcrespo)
[10:22:59] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Reenable the sflow job [puppet] - 10https://gerrit.wikimedia.org/r/772877 (https://phabricator.wikimedia.org/T302263) (owner: 10Btullis)
[10:23:54] <jynus>	 btullis: merging?
[10:24:07] <btullis>	 Yes, was about to ask you.
[10:24:25] <jynus>	 mine is ok, if it is a one line change saying wiki:testwiki
[10:24:25] <btullis>	 Happy for me to merge a89e890325 for you?
[10:24:40] <btullis>	 Done, thanks.
[10:24:49] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1081.eqiad.wmnet with reason: host reimage
[10:24:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:30] <wikibugs>	 (03CR) 10Ayounsi: "This is not WMF specific so in theory should go in the main homer branch. But realistically it doesn't matter too much :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[10:28:24] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1081.eqiad.wmnet with reason: host reimage
[10:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:09] <moritzm>	 !log restarting ntpd
[10:30:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:02] <wikibugs>	 (03PS1) 10ArielGlenn: include the dumps admins in the dumpsdata role [puppet] - 10https://gerrit.wikimedia.org/r/773195
[10:36:51] <wikibugs>	 (03Restored) 10Jcrespo: test [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344 (owner: 10Jcrespo)
[10:37:07] <wikibugs>	 (03PS2) 10Jcrespo: test [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344
[10:37:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] test [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344 (owner: 10Jcrespo)
[10:37:49] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp1082 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773196 (https://phabricator.wikimedia.org/T290005)
[10:37:51] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp1080 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773197 (https://phabricator.wikimedia.org/T290005)
[10:37:53] <icinga-wm>	 RECOVERY - Check systemd state on db1169 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:37:53] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp1078 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773198 (https://phabricator.wikimedia.org/T290005)
[10:37:55] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp1076 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773199 (https://phabricator.wikimedia.org/T290005)
[10:38:20] <wikibugs>	 (03PS1) 10Jbond: P:mediawiki: add autorestart to httpd and php [puppet] - 10https://gerrit.wikimedia.org/r/773200
[10:39:32] <wikibugs>	 (03PS3) 10Jcrespo: test [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344
[10:40:16] <wikibugs>	 (03PS32) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454)
[10:40:50] <wikibugs>	 (03Abandoned) 10Jbond: P:mediawiki: add autorestart to httpd and php [puppet] - 10https://gerrit.wikimedia.org/r/773200 (owner: 10Jbond)
[10:46:20] <wikibugs>	 (03PS1) 10Volans: cluster::management: backup also /home [puppet] - 10https://gerrit.wikimedia.org/r/773202
[10:51:48] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp1082 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773196 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:52:02] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1081.eqiad.wmnet with OS buster
[10:52:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:11] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1081.eqiad.wmnet with OS buster com...
[10:52:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond)
[10:52:38] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp1080 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773197 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:53:23] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp1078 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773198 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:53:35] <wikibugs>	 (03PS4) 10Jcrespo: Add unit testing directory so that CI succeeds [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344
[10:53:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp1076 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773199 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:55:59] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "This is ready to go, but see my comment on ticket to see if you want to add more directories now." [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans)
[10:56:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Add unit testing directory so that CI succeeds [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772344 (owner: 10Jcrespo)
[10:57:39] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:58:15] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10jbond) 05Open→03Stalled
[10:58:29] <moritzm>	 !log restarting apache on matomo1002/piwik.wikimedia.org
[10:58:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:38] <wikibugs>	 (03CR) 10Marostegui: "I have cleaned up my cumin2002 directory" [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans)
[11:00:07] <mmandere>	 !log pool cp1081 with HAProxy as TLS termination layer - T290005
[11:00:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:13] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[11:01:15] <wikibugs>	 (03PS2) 10Volans: cluster::management: backup also /home [puppet] - 10https://gerrit.wikimedia.org/r/773202
[11:01:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] cluster::management: backup also /home (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans)
[11:02:35] <wikibugs>	 (03PS3) 10Volans: cluster::management: backup also /home [puppet] - 10https://gerrit.wikimedia.org/r/773202
[11:02:37] <wikibugs>	 (03CR) 10Volans: cluster::management: backup also /home (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans)
[11:03:10] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] cluster::management: backup also /home [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans)
[11:04:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "The 6.1G were from the reimage and are now cleaned out, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans)
[11:05:52] <wikibugs>	 (03PS4) 10Volans: cluster::management: backup also /home [puppet] - 10https://gerrit.wikimedia.org/r/773202
[11:07:36] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Search broken on puppetboard - https://phabricator.wikimedia.org/T304484 (10Volans) Yes that seems a typo upstream, IIRC I reported that to John a while ago, not sure if it was fixed upstream by now.
[11:15:16] <wikibugs>	 (03PS4) 10Muehlenhoff: Enable profile::auto_restarts::service for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/769725 (https://phabricator.wikimedia.org/T135991)
[11:15:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans)
[11:17:12] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240)
[11:19:41] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable profile::auto_restarts::service for klaxon gunicorn webapp [puppet] - 10https://gerrit.wikimedia.org/r/767516 (https://phabricator.wikimedia.org/T135991)
[11:24:31] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[11:25:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for imagecatalog [puppet] - 10https://gerrit.wikimedia.org/r/773205 (https://phabricator.wikimedia.org/T135991)
[11:33:35] <moritzm>	 !log installing apache security updates on stretch
[11:33:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:16] <jbond>	 !log upload new puppetboard_3.1.0-1+deb11u1_all.deb
[11:34:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:22] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cluster::management: backup also /home [puppet] - 10https://gerrit.wikimedia.org/r/773202 (owner: 10Volans)
[11:44:08] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Search broken on puppetboard - https://phabricator.wikimedia.org/T304484 (10jbond) 05Open→03Resolved a:03jbond I have deployed an update which has fixed this, please reopen if i missed something
[11:46:14] <wikibugs>	 10SRE, 10Thumbor, 10serviceops, 10Service-deployment-requests: New Service Request Wikimedia-Thumbor - https://phabricator.wikimedia.org/T304436 (10jbond) p:05Triage→03Medium
[11:46:41] <wikibugs>	 10SRE, 10ops-eqiad, 10serviceops: mc1053 PS redundancy alert - https://phabricator.wikimedia.org/T304477 (10jbond) p:05Triage→03Medium
[11:47:41] <wikibugs>	 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10TomekSikora.Monsoon)
[11:49:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10jbond)
[11:57:34] <wikibugs>	 (03PS1) 10Jbond: admin: add sgimeno user [puppet] - 10https://gerrit.wikimedia.org/r/773207 (https://phabricator.wikimedia.org/T304361)
[11:58:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: add sgimeno user [puppet] - 10https://gerrit.wikimedia.org/r/773207 (https://phabricator.wikimedia.org/T304361) (owner: 10Jbond)
[11:59:57] <wikibugs>	 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10Aklapper) 05Open→03Stalled @TomekSikora.Monsoon: Hi. If this is a serious request and not a test, then please edit the task title (which RESOURCE?), and fill in ALL fields in the description.
[12:04:40] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10jbond) 05Open→03Resolved a:03jbond @Sgs access has now been set up you shuld have recived an email indicating how to configure kerberos, please re-open if you are s...
[12:07:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132 after testing', diff saved to https://phabricator.wikimedia.org/P23003 and previous config saved to /var/cache/conftool/dbconfig/20220323-120749-marostegui.json
[12:07:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:39] <wikibugs>	 (03CR) 10Ladsgroup: "I understand you have a large backlog but this is three weeks now." [dumps] - 10https://gerrit.wikimedia.org/r/767477 (https://phabricator.wikimedia.org/T138208) (owner: 10Ladsgroup)
[12:16:40] <wikibugs>	 (03CR) 10Jakob: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773209 (https://phabricator.wikimedia.org/T302959) (owner: 10Jakob)
[12:27:55] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-1] drop_gu_hidden_T302658.py: New schema change (032 comments) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) (owner: 10Marostegui)
[12:29:16] <moritzm>	 !log restarting Turnilo for OpenSSL update
[12:29:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:31] <wikibugs>	 (03PS1) 10Sbisson: Add Wikistories extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773212 (https://phabricator.wikimedia.org/T303004)
[12:31:47] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:35:06] <wikibugs>	 (03PS1) 10Jbond: C:icinga::monitor::cloudelastic: refactor to make a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321)
[12:35:08] <wikibugs>	 (03PS4) 10Marostegui: drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658)
[12:35:12] <wikibugs>	 (03PS1) 10Jbond: C:icinga::monitor::cloudelastic: Add checkes for certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/773214 (https://phabricator.wikimedia.org/T304321)
[12:36:33] <wikibugs>	 (03CR) 10Volans: "I didn't had a chance yet to give it a pass to the code, but I've left a comment on the packaging." [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[12:38:23] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34514/console" [puppet] - 10https://gerrit.wikimedia.org/r/773214 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[12:40:18] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) (owner: 10Marostegui)
[12:40:30] <wikibugs>	 (03PS2) 10Jbond: C:icinga::monitor::cloudelastic: refactor to make a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321)
[12:43:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34515/console" [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[12:47:30] <wikibugs>	 (03PS2) 10Jbond: nagios_common: remove -C from check_http [puppet] - 10https://gerrit.wikimedia.org/r/772869 (https://phabricator.wikimedia.org/T304321) (owner: 10Filippo Giunchedi)
[12:47:32] <wikibugs>	 (03PS1) 10Jbond: C:nagios_common: add new check for check_https_expiry [puppet] - 10https://gerrit.wikimedia.org/r/773215 (https://phabricator.wikimedia.org/T304321)
[12:51:47] <wikibugs>	 (03CR) 10Tchanders: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders)
[12:52:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34518/console" [puppet] - 10https://gerrit.wikimedia.org/r/773215 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[12:52:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:nagios_common: add new check for check_https_expiry [puppet] - 10https://gerrit.wikimedia.org/r/773215 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[12:53:05] <wikibugs>	 (03PS3) 10Jbond: C:icinga::monitor::cloudelastic: refactor to make a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321)
[12:53:53] <wikibugs>	 (03PS2) 10Jbond: C:icinga::monitor::cloudelastic: Add checkes for certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/773214 (https://phabricator.wikimedia.org/T304321)
[12:55:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34519/console" [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[12:56:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T300775)', diff saved to https://phabricator.wikimedia.org/P23004 and previous config saved to /var/cache/conftool/dbconfig/20220323-125625-marostegui.json
[12:56:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:30] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[12:58:22] <moritzm>	 !log installing bind security updates
[12:58:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:11] <wikibugs>	 (03PS1) 10Jbond: C:icinga::commons: Add ssl expiry checks for commons [puppet] - 10https://gerrit.wikimedia.org/r/773217 (https://phabricator.wikimedia.org/T304321)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:44] <Lucas_WMDE>	 uh
[13:00:56] <Lucas_WMDE>	 did I add my change to the wrong window?
[13:01:05] <Lucas_WMDE>	 damn, I added it tomorrow
[13:01:52] <Lucas_WMDE>	 should be better now
[13:02:11] <Lucas_WMDE>	 but anyways – I’m still eating lunch, so if there are no other changes in the window, I’ll be back in half an hour or so :)
[13:02:43] <wikibugs>	 (03PS2) 10Jbond: C:icinga::commons: Add ssl expiry checks for commons [puppet] - 10https://gerrit.wikimedia.org/r/773217 (https://phabricator.wikimedia.org/T304321)
[13:02:45] <wikibugs>	 (03PS1) 10Jbond: C:icinga::commons: Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773218 (https://phabricator.wikimedia.org/T304321)
[13:05:59] <wikibugs>	 (03PS1) 10Jbond: C:icinga::commons: Add ssl expiry checks for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/773219 (https://phabricator.wikimedia.org/T304321)
[13:07:48] <mmandere>	 !log depool cp1082 for reimage - T290005
[13:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:54] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[13:09:08] <wikibugs>	 (03PS1) 10Jbond: C:icinga::gitlab: Add ssl expiry checks for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/773220 (https://phabricator.wikimedia.org/T304321)
[13:11:11] <wikibugs>	 (03PS3) 10Filippo Giunchedi: nagios_common: remove -C from check_http [puppet] - 10https://gerrit.wikimedia.org/r/772869 (https://phabricator.wikimedia.org/T304321)
[13:11:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P23005 and previous config saved to /var/cache/conftool/dbconfig/20220323-131130-marostegui.json
[13:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:37] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp1082 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773196 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[13:13:42] <wikibugs>	 (03PS1) 10Jbond: C:lvs::monitor_services: Add ssl expiry checks for lvs [puppet] - 10https://gerrit.wikimedia.org/r/773221 (https://phabricator.wikimedia.org/T304321)
[13:14:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] nagios_common: remove -C from check_http [puppet] - 10https://gerrit.wikimedia.org/r/772869 (https://phabricator.wikimedia.org/T304321) (owner: 10Filippo Giunchedi)
[13:14:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1010 [puppet] - 10https://gerrit.wikimedia.org/r/773193 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[13:16:33] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1082.eqiad.wmnet with OS buster
[13:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:42] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1082.eqiad.wmnet with OS buster
[13:17:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) (owner: 10Marostegui)
[13:17:30] <wikibugs>	 (03Merged) 10jenkins-bot: drop_gu_hidden_T302658.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773134 (https://phabricator.wikimedia.org/T302658) (owner: 10Marostegui)
[13:18:09] <wikibugs>	 (03PS1) 10Jbond: C:noc: Add ssl expiry checks for noc [puppet] - 10https://gerrit.wikimedia.org/r/773223 (https://phabricator.wikimedia.org/T304321)
[13:19:31] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1010.eqiad.wmnet with OS bullseye
[13:19:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:14] <icinga-wm>	 PROBLEM - puppetboard-samltest.wikimedia.org requires authentication on puppetboard2002 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-samltest.wikimedia.org:443/ - 582 bytes in 1.144 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:20:27] <icinga-wm>	 PROBLEM - Check to ensure the cfssl signer is working CA: cloud_wmnet_ca #page on pki2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - string success:true not found on https://pki.discovery.wmnet:443/api/v1/cfssl/info - 446 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/PKI/CA_Operations
[13:20:35] <volans>	 godog, jbond ^^^
[13:20:46] <jbond>	 looking
[13:20:47] <godog>	 thanks volans, indeed
[13:20:48] <icinga-wm>	 PROBLEM - puppetboard-idptest.wikimedia.org requires authentication on puppetboard1002 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-idptest.wikimedia.org:443/ - 580 bytes in 1.009 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:20:49] <godog>	 acking
[13:21:20] <godog>	 so nothing is broken per-se, or at least not more broken than five minutes ago
[13:21:36] <jbond>	 thanks godog, ill fix this, looks like the url check never worked
[13:21:48] <jbond>	 well is using the wrong url 
[13:22:18] <godog>	 jbond: yeah, the pki alert or puppetboard-saml or both ?
[13:22:52] <icinga-wm>	 PROBLEM - Check to ensure the cfssl signer is working CA: cloud_wmnet_ca #page on pki1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - string success:true not found on https://pki.discovery.wmnet:443/api/v1/cfssl/info - 446 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/PKI/CA_Operations
[13:23:05] <godog>	 ditto ^
[13:23:09] * volans acked on VO
[13:23:11] * Emperor here
[13:23:25] <Emperor>	 ...just a bit too late, as ever :-/
[13:23:26] * jhathaway here as well
[13:23:40] <jbond>	 nothing to see here sorry for the noise
[13:24:02] <jhathaway>	 np
[13:24:12] <godog>	 indeed sorry for the mispages, all for the better though at least
[13:24:18] <volans>	 may I suggest to add some downtime to the modified checks so to spot the failing ones on icinga without having to page?
[13:25:00] <icinga-wm>	 PROBLEM - Check to ensure the cfssl signer is working CA: debmonitor #page on pki1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 400 Bad Request - string success:true not found on https://pki.discovery.wmnet:443/api/v1/cfssl/info - 446 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/PKI/CA_Operations
[13:25:02] <wikibugs>	 (03PS1) 10Jbond: C:openstack::keystone: Add ssl expiry checks for keystone [puppet] - 10https://gerrit.wikimedia.org/r/773224 (https://phabricator.wikimedia.org/T304321)
[13:25:18] <godog>	 volans: yes will do
[13:26:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P23008 and previous config saved to /var/cache/conftool/dbconfig/20220323-132635-marostegui.json
[13:26:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:39] <Amir1>	 Funny I didn't get paged with vo app just the text. But I got push notification when it got resolved 🤦🤦🤦
[13:27:12] <godog>	 {{done}} downtimed the cfssl p a g e alerts
[13:28:25] <Lucas_WMDE>	 alright, I’m back
[13:28:41] <Lucas_WMDE>	 can I proceed with the backport+config window or is something going on?
[13:28:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1010.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:29:36] <godog>	 Lucas_WMDE: good to go I think
[13:29:41] <Lucas_WMDE>	 great, thanks
[13:33:14] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Write "unexpectedUnconnectedPage" page prop on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768090
[13:33:25] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1082.eqiad.wmnet with reason: host reimage
[13:33:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:55] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1010.eqiad.wmnet with reason: host reimage
[13:34:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:07] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1027.eqiad.wmnet with OS bullseye
[13:35:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1027.eqiad.wmnet with OS bullseye
[13:35:42] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Write "unexpectedUnconnectedPage" page prop on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768090 (owner: 10Lucas Werkmeister (WMDE))
[13:36:28] <wikibugs>	 (03Merged) 10jenkins-bot: Write "unexpectedUnconnectedPage" page prop on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768090 (owner: 10Lucas Werkmeister (WMDE))
[13:36:57] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1082.eqiad.wmnet with reason: host reimage
[13:37:00] <Lucas_WMDE>	 testing on mwdebug1001
[13:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:13] <icinga-wm>	 PROBLEM - puppetboard-idptest.wikimedia.org requires authentication on puppetboard2002 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-idptest.wikimedia.org:443/ - 580 bytes in 1.144 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:38:04] <Lucas_WMDE>	 seems to be working fine, syncing
[13:38:57] <moritzm>	 !log restarting superset for OpenSSL update
[13:38:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1010.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:39:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:27] <elukey>	 kubernetes1010 is me, reimaging
[13:39:31] <wikibugs>	 (03PS1) 10Jbond: P:pki: fix nagios checks for PKI [puppet] - 10https://gerrit.wikimedia.org/r/773227
[13:39:44] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Enable Wikibase REST API on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773209 (https://phabricator.wikimedia.org/T302959) (owner: 10Jakob)
[13:39:45] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1010.eqiad.wmnet with reason: host reimage
[13:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:768090|Write "unexpectedUnconnectedPage" page prop on Test Wikidata clients]] (duration: 01m 10s)
[13:39:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:26] <wikibugs>	 (03PS2) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240)
[13:41:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T300775)', diff saved to https://phabricator.wikimedia.org/P23009 and previous config saved to /var/cache/conftool/dbconfig/20220323-134140-marostegui.json
[13:41:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[13:41:44] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[13:41:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[13:41:45] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[13:41:47] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable Wikibase REST API on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773209 (https://phabricator.wikimedia.org/T302959) (owner: 10Jakob)
[13:41:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[13:41:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34520/console" [puppet] - 10https://gerrit.wikimedia.org/r/773227 (owner: 10Jbond)
[13:41:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T300775)', diff saved to https://phabricator.wikimedia.org/P23010 and previous config saved to /var/cache/conftool/dbconfig/20220323-134153-marostegui.json
[13:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:26] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Wikibase REST API on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773209 (https://phabricator.wikimedia.org/T302959) (owner: 10Jakob)
[13:42:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki: fix nagios checks for PKI [puppet] - 10https://gerrit.wikimedia.org/r/773227 (owner: 10Jbond)
[13:43:11] <Lucas_WMDE>	 checking that the beta change does nothing on mwdebug1001…
[13:43:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:43:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:04] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:44:31] <wikibugs>	 (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[13:44:42] <Lucas_WMDE>	 looks good I think, I’ll sync it
[13:45:09] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1029.eqiad.wmnet with OS bullseye
[13:45:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:28] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1010.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:46:16] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1027.eqiad.wmnet with reason: host reimage
[13:46:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:773209|Enable Wikibase REST API on beta wikidata (T302959)]] (1/2, production no-op) (duration: 01m 07s)
[13:46:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:30] <stashbot>	 T302959: Create a test/validation system for the Wikibase REST API - https://phabricator.wikimedia.org/T302959
[13:47:06] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Wikimedia Mailing List not found on https://lists.wikimedia.org:443/hyperkitty/list/wikimedia-l@lists.wikimedia.org/ - 47822 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:47:43] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:773209|Enable Wikibase REST API on beta wikidata (T302959)]] (2/2, production no-op) (duration: 01m 05s)
[13:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:48:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:26] <Lucas_WMDE>	 !log UTC afternoon backport window done
[13:48:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:01] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Wikimedia Mailing List not found on https://lists.wikimedia.org:443/postorius/lists/wikimedia-l.lists.wikimedia.org/ - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:49:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:49:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:43] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1010.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:50:53] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1027.eqiad.wmnet with reason: host reimage
[13:50:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:48] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1010.eqiad.wmnet with OS bullseye
[13:51:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:01] <wikibugs>	 (03PS1) 10Jbond: PKI: double escape, one for puppet one for icinga [puppet] - 10https://gerrit.wikimedia.org/r/773231
[13:54:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:54:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:49] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1029.eqiad.wmnet with reason: host reimage
[13:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:56] <wikibugs>	 (03PS1) 10Urbanecm: addWiki: Create GrowthExperiment's tables for all new Wikipedia [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772900 (https://phabricator.wikimedia.org/T304052)
[13:56:04] <urbanecm>	 jouncebot: nowandnext
[13:56:05] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T1300)
[13:56:05] <jouncebot>	 In 1 hour(s) and 3 minute(s): New wiki creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T1500)
[13:56:21] <wikibugs>	 (03PS1) 10Urbanecm: addWiki: Create GrowthExperiment's tables for all new Wikipedia [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/772901 (https://phabricator.wikimedia.org/T304052)
[13:56:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] PKI: double escape, one for puppet one for icinga [puppet] - 10https://gerrit.wikimedia.org/r/773231 (owner: 10Jbond)
[13:56:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "deploying" [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/772901 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm)
[13:56:47] <icinga-wm>	 PROBLEM - puppetboard-samltest.wikimedia.org requires authentication on puppetboard1002 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-samltest.wikimedia.org:443/ - 582 bytes in 1.007 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:57:04] <urbanecm>	 since wmf.4's not at deploy1002 yet, just +2'ed to ensure it will ride with wmf.4
[13:57:23] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[13:57:25] <urbanecm>	 will do wmf.3 soon, so i can ensure the change works in the new wiki creation window in an hour
[13:57:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Hue [puppet] - 10https://gerrit.wikimedia.org/r/773232 (https://phabricator.wikimedia.org/T135991)
[13:57:50] <wikibugs>	 (03PS1) 10Majavah: openstack::nova::fullstack: restart service on setting changes [puppet] - 10https://gerrit.wikimedia.org/r/773233
[13:57:52] <wikibugs>	 (03PS1) 10Majavah: openstack::nova::fullstack: use bullseye image [puppet] - 10https://gerrit.wikimedia.org/r/773234
[13:58:40] <wikibugs>	 (03Merged) 10jenkins-bot: addWiki: Create GrowthExperiment's tables for all new Wikipedia [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/772901 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm)
[13:58:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:58:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:58:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[13:59:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Enable profile::auto_restarts::service for uwsgi/graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/773190 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:59:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:59:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:36] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1029.eqiad.wmnet with reason: host reimage
[13:59:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:48] <wikibugs>	 (03PS2) 10Urbanecm: Initial configuration for shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771546 (https://phabricator.wikimedia.org/T302797)
[14:00:22] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0)
[14:00:25] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1082.eqiad.wmnet with OS buster
[14:00:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:36] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1082.eqiad.wmnet with OS buster com...
[14:00:44] <wikibugs>	 (03PS2) 10Urbanecm: Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727)
[14:02:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727) (owner: 10Urbanecm)
[14:04:19] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[14:04:19] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[14:04:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:32] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[14:04:32] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[14:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:04:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:51] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[14:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:02] <mmandere>	 !log pool cp1082 with HAProxy as TLS termination layer - T290005
[14:06:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:06] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[14:08:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:08:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:08:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:09:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova::fullstack: restart service on setting changes [puppet] - 10https://gerrit.wikimedia.org/r/773233 (owner: 10Majavah)
[14:10:59] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1027.eqiad.wmnet with OS bullseye
[14:11:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:09] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1027.eqiad.wmnet with OS bullseye completed...
[14:11:55] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[14:11:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova::fullstack: use bullseye image [puppet] - 10https://gerrit.wikimedia.org/r/773234 (owner: 10Majavah)
[14:13:27] <wikibugs>	 (03PS2) 10Kosta Harlan: betalabs: Enable Watchlist Echo notifications feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941)
[14:14:33] <wikibugs>	 (03PS3) 10Kosta Harlan: betalabs: Enable Watchlist Echo notifications feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941)
[14:14:39] <wikibugs>	 (03PS1) 10Jbond: PKI: add '}' back [puppet] - 10https://gerrit.wikimedia.org/r/773237
[14:14:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] PKI: add '}' back [puppet] - 10https://gerrit.wikimedia.org/r/773237 (owner: 10Jbond)
[14:15:42] <wikibugs>	 (03PS4) 10Kosta Harlan: betalabs: Enable Watchlist Echo notifications feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941)
[14:16:37] <kostajh>	 urbanecm: ^ is it OK to +2 this, or do we need to do a sync as well?
[14:17:21] <taavi>	 it only touches a -labs.php file, so you need to +2 and pull to deploy1002 but you don't need to sync it
[14:17:35] <wikibugs>	 10SRE, 10ops-eqiad: asw2-b-eqiad:FPC5 <-> FPC7 link down - https://phabricator.wikimedia.org/T304488 (10Jclark-ctr) 05Open→03Resolved Found Dac cable in rack B7 not seated reseated cable and confirmed link with @ayounsi
[14:18:18] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1029.eqiad.wmnet with OS bullseye
[14:18:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:13] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=yes; selector: name=wcqs1002.eqiad.wmnet
[14:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10cmooney) Thinking about this further I think it works from the CRs because the peering is from the local public/private subnet to the loopbac...
[14:20:03] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[14:20:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:09] <kostajh>	 taavi: thanks. I assume it's OK for me to +2 it since I've gotten +1s from two others, and it's -labs only
[14:22:53] <taavi>	 yeah, sounds fine to me
[14:23:51] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] "Per Martin & Sergio" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941) (owner: 10Kosta Harlan)
[14:23:56] <bblack>	 !log reboot cp1085 (downtimed)
[14:23:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:32] <wikibugs>	 (03Merged) 10jenkins-bot: betalabs: Enable Watchlist Echo notifications feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941) (owner: 10Kosta Harlan)
[14:25:39] <kostajh>	 taavi: how do I pull to deploy1002?
[14:26:34] <kostajh>	 scap sync-file?
[14:27:07] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0)
[14:27:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:22] <taavi>	 cd to /srv/mediawiki-staging and then just git fetch && git rebase
[14:28:32] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[14:28:33] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Write "unexpectedUnconnectedPage" page prop everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773239
[14:28:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:00] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "I think we can deploy this tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773239 (owner: 10Lucas Werkmeister (WMDE))
[14:29:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:29:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:30:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:30:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:31:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:16] <urbanecm>	 kostajh: are you deploying something please?
[14:33:31] <urbanecm>	 (I'd like to, so that's why I'm asking)
[14:33:34] <icinga-wm>	 PROBLEM - Host wcqs2001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:33:41] <kostajh>	 urbanecm: I just +2'ed that beta labs config patch
[14:33:43] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1030.eqiad.wmnet with OS bullseye
[14:33:46] <kostajh>	 but didn't do anything else yet
[14:33:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:05] <taavi>	 s/beta labs/beta cluster/ please
[14:34:17] <urbanecm>	 :))
[14:34:18] <kostajh>	 urbanecm: I'll do the git fetch and rebase step in mediawiki-staging
[14:34:20] <kostajh>	 heh, sure
[14:34:25] <urbanecm>	 kostajh: okay, please ping me once done :)
[14:34:58] <kostajh>	 I don't see the patch in `git log` on mediawiki-staging. Does it take some time to show up?
[14:35:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Add helm charts and a helmfile configuration for datahub (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[14:35:04] <icinga-wm>	 RECOVERY - Host wcqs2001 is UP: PING OK - Packet loss = 0%, RTA = 32.73 ms
[14:35:25] <urbanecm>	 kostajh: you need to do git fetch manually
[14:35:28] <urbanecm>	 there's no autopull
[14:35:45] <kostajh>	 urbanecm: I've done that
[14:35:55] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp1080 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773197 (https://phabricator.wikimedia.org/T290005)
[14:36:03] <taavi>	 did you rebase too?
[14:36:04] <urbanecm>	 kostajh: okay. Does git log -p HEAD..@{u} show your patch?
[14:36:08] <urbanecm>	 (and only your patch)
[14:36:11] <urbanecm>	 if so, do git rebase
[14:36:40] <kostajh>	 urbanecm: ah, ok. done
[14:36:42] <kostajh>	 thanks. 
[14:36:43] <kostajh>	 over to you
[14:36:46] <urbanecm>	 thanks
[14:37:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] addWiki: Create GrowthExperiment's tables for all new Wikipedia [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772900 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm)
[14:37:32] <mmandere>	 !log depool cp1080 for reimage - T290005
[14:37:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:37] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[14:38:12] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp1085.eqiad.wmnet
[14:38:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:14] <wikibugs>	 (03Merged) 10jenkins-bot: addWiki: Create GrowthExperiment's tables for all new Wikipedia [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772900 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm)
[14:40:01] <icinga-wm>	 RECOVERY - Check to ensure the cfssl signer is working CA: debmonitor #page on pki1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1756 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/PKI/CA_Operations
[14:40:20] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp1080 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773197 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[14:41:47] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.3/extensions/WikimediaMaintenance/addWiki.php: 9a0aed0: addWiki: Create GrowthExperiment tables for all new Wikipedias (T304052) (duration: 01m 06s)
[14:41:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:53] <stashbot>	 T304052: Enable Growth features on Wikipedias upon creation - https://phabricator.wikimedia.org/T304052
[14:41:55] <urbanecm>	 done with deployment for now
[14:42:02] <urbanecm>	 (will be back in ~15 mins for the wiki creation window)
[14:44:39] <wikibugs>	 (03PS1) 10Jbond: P:pki::multirootca::monitoring: triple escape :/ [puppet] - 10https://gerrit.wikimedia.org/r/773243
[14:44:39] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1080.eqiad.wmnet with OS buster
[14:44:40] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1030.eqiad.wmnet with reason: host reimage
[14:44:42] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1030.eqiad.wmnet with reason: host reimage
[14:44:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:49] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1080.eqiad.wmnet with OS buster
[14:45:06] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1031.eqiad.wmnet with OS bullseye
[14:45:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34522/console" [puppet] - 10https://gerrit.wikimedia.org/r/773243 (owner: 10Jbond)
[14:46:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca::monitoring: triple escape :/ [puppet] - 10https://gerrit.wikimedia.org/r/773243 (owner: 10Jbond)
[14:46:32] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw2-b-eqiad is OK: OK: UP: 24 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[14:46:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:46:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:47:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:47:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:06] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] map Portugal to drmrs [dns] - 10https://gerrit.wikimedia.org/r/772876 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack)
[14:48:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:48:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:08] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0)
[14:50:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:47] <wikibugs>	 (03PS1) 10BBlack: map Spain to drmrs [dns] - 10https://gerrit.wikimedia.org/r/773244 (https://phabricator.wikimedia.org/T304089)
[14:50:49] <wikibugs>	 (03PS1) 10BBlack: map France to drmrs [dns] - 10https://gerrit.wikimedia.org/r/773245 (https://phabricator.wikimedia.org/T304089)
[14:51:26] <icinga-wm>	 RECOVERY - Check to ensure the cfssl signer is working CA: cloud_wmnet_ca #page on pki1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1770 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/PKI/CA_Operations
[14:54:04] <icinga-wm>	 RECOVERY - Check to ensure the cfssl signer is working CA: cloud_wmnet_ca #page on pki2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1770 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/PKI/CA_Operations
[14:54:57] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 85 probes of 675 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:59:19] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1031.eqiad.wmnet with reason: host reimage
[14:59:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:05] <jouncebot>	 Urbanecm and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for New wiki creation . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T1500).
[15:00:09] <urbanecm>	 o/
[15:00:12] <urbanecm>	 Amir1: let's start?
[15:00:21] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 61 probes of 675 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:00:24] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1030.eqiad.wmnet with OS bullseye
[15:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:35] <Amir1>	 sure!
[15:00:43] <urbanecm>	 okay, +2'ing the first one
[15:00:49] <wikibugs>	 (03PS3) 10Urbanecm: Initial configuration for shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771546 (https://phabricator.wikimedia.org/T302797)
[15:00:55] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Initial configuration for shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771546 (https://phabricator.wikimedia.org/T302797) (owner: 10Urbanecm)
[15:01:28] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1080.eqiad.wmnet with reason: host reimage
[15:01:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:59] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771546 (https://phabricator.wikimedia.org/T302797) (owner: 10Urbanecm)
[15:02:44] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1031.eqiad.wmnet with reason: host reimage
[15:02:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:39] <urbanecm>	 pulling to mwmaint
[15:03:54] <wikibugs>	 (03PS1) 10Jbond: P:puppetboard: don't monitor testing sites [puppet] - 10https://gerrit.wikimedia.org/r/773248
[15:04:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] P:puppetboard: don't monitor testing sites [puppet] - 10https://gerrit.wikimedia.org/r/773248 (owner: 10Jbond)
[15:04:30] <urbanecm>	 running addwiki
[15:05:02] <urbanecm>	 db was created at db1130, which is s5 primary
[15:05:06] <urbanecm>	 pulling to mwdebug
[15:05:37] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1080.eqiad.wmnet with reason: host reimage
[15:05:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:13] <urbanecm>	 wiki works, syncing
[15:08:26] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/db-production.php: Creating shnwikivoyage (T302797) (duration: 01m 05s)
[15:08:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:31] <stashbot>	 T302797: Create Wikivoyage Shan - https://phabricator.wikimedia.org/T302797
[15:08:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:08:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:09:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:09:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:39] <wikibugs>	 (03PS2) 10Jbond: C:icinga::commons: Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773218 (https://phabricator.wikimedia.org/T304321)
[15:09:39] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized dblists: Creating shnwikivoyage (T302797) (duration: 01m 05s)
[15:09:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:31] <wikibugs>	 (03PS3) 10Urbanecm: Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727)
[15:11:05] <wikibugs>	 (03PS4) 10Urbanecm: Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727)
[15:12:02] <logmsgbot>	 !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating shnwikivoyage (T302797)
[15:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:10] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating shnwikivoyage (T302797) (duration: 01m 05s)
[15:13:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727) (owner: 10Urbanecm)
[15:14:19] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating shnwikivoyage (T302797) (duration: 01m 05s)
[15:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:23] <stashbot>	 T302797: Create Wikivoyage Shan - https://phabricator.wikimedia.org/T302797
[15:14:31] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727) (owner: 10Urbanecm)
[15:14:33] <urbanecm>	 and the last sync...
[15:15:03] <taavi>	 zabe: ah I see you're doing the exact same thing I am :P
[15:15:17] <urbanecm>	 taavi: acquiring low IDs?
[15:15:20] <taavi>	 yes
[15:15:27] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating shnwikivoyage (T302797) (duration: 01m 05s)
[15:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:36] <wikibugs>	 (03PS1) 10Jbond: P:chartmuseum:  Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773249 (https://phabricator.wikimedia.org/T304321)
[15:15:42] <Amir1>	 taavi: I'm still mad about mailman
[15:15:49] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773218 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:15:49] <Amir1>	 the apache was not properly up
[15:15:52] <urbanecm>	 :(
[15:16:04] <zabe>	 taavi, it's the same game as always :p
[15:16:14] <Amir1>	 in your defense, that's quite an achievement 
[15:16:17] <wikibugs>	 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) For the Ingress part we will need to use two different names/discovery records for the services (as we can't distinguish by port). Maybe `datahub.disc...
[15:16:19] <urbanecm>	 no one beated Maintenance script so far :))
[15:16:31] <wikibugs>	 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) a:03JMeybohm
[15:16:42] <urbanecm>	 okay, let's see if my change to addWiki.php works
[15:17:27] <Amir1>	 urbanecm: the growth table? I'm not sure if it's deployed yet
[15:17:46] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp1078 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773198 (https://phabricator.wikimedia.org/T290005)
[15:17:47] <urbanecm>	 i backported it earlier today
[15:17:47] <taavi>	 I would need to convince someone to +2 a addWiki.php patch if I wanted to beat maintenance script
[15:17:48] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp1076 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773199 (https://phabricator.wikimedia.org/T290005)
[15:17:50] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp2033 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773250 (https://phabricator.wikimedia.org/T290005)
[15:17:52] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp2031 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773251 (https://phabricator.wikimedia.org/T290005)
[15:17:54] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp2029 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773252 (https://phabricator.wikimedia.org/T290005)
[15:17:56] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp2027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773253 (https://phabricator.wikimedia.org/T290005)
[15:18:20] <urbanecm>	 and the tables are there too
[15:18:26] <urbanecm>	 so it worked :))
[15:18:43] <urbanecm>	 and the wiki's up too, so...syncing
[15:18:44] <wikibugs>	 (03PS1) 10Jbond: P:debmonitor::server:  Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773254 (https://phabricator.wikimedia.org/T304321)
[15:19:29] <zabe>	 It was possible some time ago when maintenance script was broken, e.g. shiwiki
[15:19:54] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/db-production.php: Creating guwwiki (T303727) (duration: 01m 05s)
[15:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:01] <stashbot>	 T303727: Create Wikipedia Gungbe - https://phabricator.wikimedia.org/T303727
[15:20:10] <taavi>	 I think it's actually fairly recent that it's using User:Maintenance_script, previously those edits were attributed to 127.0.0.1
[15:20:15] <urbanecm>	 yup yup
[15:20:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:20:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:53] <wikibugs>	 (03PS1) 10JMeybohm: Allow multiple tlsHostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/773255 (https://phabricator.wikimedia.org/T290966)
[15:20:57] <wikibugs>	 (03PS1) 10JMeybohm: Add correct tlsHostnames and extra SAN to datahub cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/773256 (https://phabricator.wikimedia.org/T303049)
[15:21:14] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized dblists: Creating guwwiki (T303727) (duration: 01m 10s)
[15:21:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:24] <wikibugs>	 (03PS1) 10Jbond: P:docker_registry_ha::registry:  Add ssl expiry checks [puppet] - 10https://gerrit.wikimedia.org/r/773257 (https://phabricator.wikimedia.org/T304321)
[15:21:27] <wikibugs>	 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) >>! In T303049#7800172, @JMeybohm wrote: > For the Ingress part we will need to use two different names/discovery records for the services (as we can't...
[15:21:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:21:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:21:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:34] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/773218 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:22:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:22:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:53] <zabe>	 hmm, addwiki.php is throwing some PHP Deprecated: Deprecated cross-wiki access to MediaWiki\Revision\RevisionRecord. Expected: the local wiki, Actual: 'guwwiki'.
[15:23:03] <urbanecm>	 :(
[15:23:08] <urbanecm>	 zabe: can you check if it has a task?
[15:23:09] <logmsgbot>	 !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating guwwiki (T303727)
[15:23:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:40] <zabe>	 yes, I can't find one, let me create one
[15:23:52] <urbanecm>	 thanks zabe 
[15:23:58] <urbanecm>	 first time i see scap saying `15:23:16 Huh, lock file disappeared before deletion. This is probably fine-ish :)`
[15:24:07] <urbanecm>	 i guess that's because i do a lot of syncs now?
[15:24:22] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating guwwiki (T303727) (duration: 01m 06s)
[15:24:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:53] <dancy>	 urbanecm: Hmm... I'll check the code 
[15:25:08] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1031.eqiad.wmnet with OS bullseye
[15:25:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:29] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating guwwiki (T303727) (duration: 01m 05s)
[15:25:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:34] <stashbot>	 T303727: Create Wikipedia Gungbe - https://phabricator.wikimedia.org/T303727
[15:26:38] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating guwwiki (T303727) (duration: 01m 07s)
[15:26:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:49] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized langlist: Creating guwwiki (T303727) (duration: 01m 04s)
[15:27:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:55] <wikibugs>	 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) >>! In T303049#7800187, @BTullis wrote: > It's not going to affect the public-facing (but authenticated) URL of https://datahub.wikimedia.org for the...
[15:27:56] <urbanecm>	 okay, per wiki syncs are done now
[15:28:03] <urbanecm>	 updating interwiki cache now
[15:28:10] <zabe>	 created T304528
[15:28:11] <stashbot>	 T304528: PHP Deprecated: Deprecated cross-wiki access to MediaWiki\Revision\RevisionRecord. Expected: the local wiki, Actual: 'guwwiki'. Pass expected $wikiId. [Called from MediaWiki\Revision\RevisionRecord::getPageId] - https://phabricator.wikimedia.org/T304528
[15:28:18] <urbanecm>	 if only it worked...
[15:28:45] <urbanecm>	 i can't run scap update-interwiki-cache https://www.irccloud.com/pastebin/SzrfJDJ1/
[15:28:50] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1080.eqiad.wmnet with OS buster
[15:28:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:58] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1080.eqiad.wmnet with OS buster com...
[15:30:47] <urbanecm>	 Amir1: taavi: zabe: any idea wh that's happening?
[15:31:05] <urbanecm>	 i see that error thrown at https://gerrit.wikimedia.org/g/mediawiki/core/+/77e159c161a7b83ebe72d4c614674aaf64f7f0fc/includes/interwiki/ClassicInterwikiLookup.php#130, but...wgInterwikiCache should be an array
[15:31:42] <Amir1>	 interwiki cache is not urgent
[15:31:50] <Amir1>	 but yeah,   messed up
[15:31:58] <mmandere>	 !log pool cp1080 with HAProxy as TLS termination layer - T290005
[15:32:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:03] <urbanecm>	 it's not, I'm just wondering what happened with it :)
[15:32:03] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[15:32:09] <urbanecm>	 happy to phabricatorize & leave for later
[15:32:11] <Amir1>	 I think it might be due to configuration handling changes in core
[15:32:33] <Amir1>	 yeah, let's have a phabricator ticket for it
[15:32:35] <zabe>	 there was some refactoring that happened
[15:32:47] <zabe>	 I guess mwscript extensions/WikimediaMaintenance/dumpInterwiki.php --wiki=aawiki should work as alternative
[15:33:10] <urbanecm>	 hmm, that works
[15:33:26] <zabe>	 weird
[15:34:02] <urbanecm>	 and `/usr/local/bin/mwscript extensions/WikimediaMaintenance/dumpInterwiki.php`, which is what scap update-interwiki-cache runs, works too
[15:34:35] * urbanecm is confused
[15:35:32] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[15:35:36] <dancy>	 urbanecm: Do you have  full transcript that includes that warning about the lockfile being missing?
[15:35:47] <urbanecm>	 dancy: should be in my scrollback, gimme a sec
[15:36:40] <urbanecm>	 dancy: fyi also filled T304529 about scap (the interwiki issue).
[15:36:40] <stashbot>	 T304529: scap update-interwiki-cache throws MWException: Setting $wgInterwikiCache to a CDB path is no longer supported - https://phabricator.wikimedia.org/T304529
[15:37:15] <urbanecm>	 dancy: unfortunately, the lockfile part of the scrollback is gone now. but it was a regular sync, with regular messages, just this one appeared at the top
[15:37:21] <urbanecm>	 and it happened for a single deployment only
[15:37:29] <urbanecm>	 syncs before and after worked fine
[15:38:04] <dancy>	 Hmm.. no use of control-c ?
[15:38:15] <urbanecm>	 nope
[15:38:29] <urbanecm>	 just copy&pasting scap sync-file's to my bash session
[15:38:38] <dancy>	 alright. thanks
[15:38:51] <urbanecm>	 !log Created shnwikivoyage and guwwiki
[15:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:16] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[15:39:50] <urbanecm>	 !log foreachwikiindblist wikipedia extensions/WikimediaMaintenance/createExtensionTables.php growthexperiments # T304052
[15:39:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:57] <wikibugs>	 (03CR) 10STran: Allow autoconfirmed users to view basic IP information (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders)
[15:39:58] <stashbot>	 T304052: Enable Growth features on Wikipedias upon creation - https://phabricator.wikimedia.org/T304052
[15:40:56] <wikibugs>	 (03CR) 10Urbanecm: Allow autoconfirmed users to view basic IP information (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders)
[15:42:29] <urbanecm>	 I'll use the remainder of my window to test the rest of T304052 (now that the tables are at all Wikipedias)
[15:46:34] <wikibugs>	 (03CR) 10STran: [C: 03+1] Allow autoconfirmed users to view basic IP information [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders)
[15:46:52] <urbanecm>	 Amir1: when accessing guw.wikipedia.org from my staff acc, i get a `2022-03-23 15:45:50 [0d2f325a-dfca-46f8-ae9f-036af9c33950] mw1320 guwwiki 1.39.0-wmf.3 exception ERROR: [0d2f325a-dfca-46f8-ae9f-036af9c33950] /   Wikimedia\Rdbms\DBQueryError: Error 1205: Lock wait timeout exceeded; try restarting transaction (db1130)` :(
[15:46:59] <wikibugs>	 (03PS1) 10Majavah: admin: add developer-portal namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/773267 (https://phabricator.wikimedia.org/T297140)
[15:47:04] <urbanecm>	 `Query: INSERT IGNORE INTO `user_properties` (up_user,up_property,up_value) VALUES (41,'VectorSkinVersion','1')`, from `MediaWiki\User\UserOptionsManager::saveOptionsInternal`
[15:47:17] <Amir1>	 :/
[15:47:32] <Amir1>	 let me check
[15:47:35] <wikibugs>	 (03PS1) 10Majavah: Add dummy tokens for developer-portal [labs/private] - 10https://gerrit.wikimedia.org/r/773268 (https://phabricator.wikimedia.org/T297140)
[15:47:45] <urbanecm>	 funnily enough, user_id=41 matches zero rows
[15:47:57] <taavi>	 oh I've seen that before, I think the last update on that task was 'it was fixed'
[15:48:20] <wikibugs>	 (03PS1) 10Majavah: Add developer-portal k8s accounts [puppet] - 10https://gerrit.wikimedia.org/r/773270 (https://phabricator.wikimedia.org/T297140)
[15:48:39] <urbanecm>	 taavi: you've seen that for newly born wikis, or in general?
[15:48:54] <taavi>	 in general when creating accounts
[15:48:59] <urbanecm>	 i see
[15:49:03] <taavi>	 lemme try to find that task
[15:49:18] <taavi>	 https://phabricator.wikimedia.org/T294995
[15:49:28] <Amir1>	 yeah, basically when two users trying to be created at the same time
[15:50:18] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1032.eqiad.wmnet with OS bullseye
[15:50:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:26] <zabe>	 so it's still not fixed :/
[15:50:37] <urbanecm>	 looks so :/
[15:50:43] <urbanecm>	 taavi: thanks for the link, left a comment there
[15:51:36] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[15:52:02] <zabe>	 although the stack is a different one
[15:52:11] <wikibugs>	 (03PS1) 10Majavah: kubeadm::helm: install helmfile [puppet] - 10https://gerrit.wikimedia.org/r/773271 (https://phabricator.wikimedia.org/T304532)
[15:52:20] * urbanecm done with T304052 testing
[15:55:53] <wikibugs>	 (03PS1) 10Jbond: O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272
[15:56:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272 (owner: 10Jbond)
[15:56:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm::helm: install helmfile [puppet] - 10https://gerrit.wikimedia.org/r/773271 (https://phabricator.wikimedia.org/T304532) (owner: 10Majavah)
[15:57:30] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Wow!" [puppet] - 10https://gerrit.wikimedia.org/r/773271 (https://phabricator.wikimedia.org/T304532) (owner: 10Majavah)
[15:58:32] <wikibugs>	 (03PS1) 10Majavah: kubeadm::helm: use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/773274
[15:58:36] <wikibugs>	 (03PS1) 10Majavah: kubeadm::helm: configure default HELMFILE_ENVIRONMENT [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532)
[15:59:45] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34523/console" [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532) (owner: 10Majavah)
[16:00:25] <wikibugs>	 (03PS33) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454)
[16:00:29] <wikibugs>	 (03PS1) 10Majavah: kubeadm::helm: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/773277
[16:00:31] <wikibugs>	 (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[16:00:38] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:01:51] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2033 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773250 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[16:02:00] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 72 probes of 675 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:02:09] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2031 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773251 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[16:03:13] <wikibugs>	 (03PS2) 10Majavah: kubeadm::helm: use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/773274
[16:03:15] <wikibugs>	 (03PS2) 10Majavah: kubeadm::helm: configure default HELMFILE_ENVIRONMENT [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532)
[16:04:23] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1032.eqiad.wmnet with reason: host reimage
[16:04:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:50] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] kubeadm::helm: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/773277 (owner: 10Majavah)
[16:05:11] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2029 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773252 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[16:05:46] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1011 [puppet] - 10https://gerrit.wikimedia.org/r/773278 (https://phabricator.wikimedia.org/T300744)
[16:05:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10KFrancis) @jbond @TheDJ The agreement has been sent out for signatures.
[16:06:04] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773253 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[16:06:19] <zabe>	 Amir1, btw is there a specific reason why the 'post-creation' tasks are created with a custom edit policy?
[16:07:21] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1032.eqiad.wmnet with reason: host reimage
[16:07:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:25] <Amir1>	 zabe: i doubt it. Can you check the code? On phone atm 
[16:07:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1011 [puppet] - 10https://gerrit.wikimedia.org/r/773278 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[16:08:00] <zabe>	 i can take a look
[16:08:21] <wikibugs>	 (03PS2) 10Jbond: O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272
[16:09:38] <wikibugs>	 (03Abandoned) 10Ssingh: icinga: add ssingh to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/767530 (owner: 10Ssingh)
[16:10:12] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1011.eqiad.wmnet with OS bullseye
[16:10:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:18] <wikibugs>	 (03CR) 10Jbond: "once this is in place we can update the command definitions to use this new check.  this will prevent us from having to create dedicate mo" [puppet] - 10https://gerrit.wikimedia.org/r/773272 (owner: 10Jbond)
[16:12:40] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] kubeadm::helm: configure default HELMFILE_ENVIRONMENT [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532) (owner: 10Majavah)
[16:14:42] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 63 probes of 675 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:18:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubeadm::helm: use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/773274 (owner: 10Majavah)
[16:18:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubeadm::helm: configure default HELMFILE_ENVIRONMENT [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532) (owner: 10Majavah)
[16:18:39] <wikibugs>	 (03PS3) 10Jbond: O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321)
[16:18:41] <wikibugs>	 (03PS1) 10Jbond: icinga: move client_auth_puppet_post to use wmf_check_http [puppet] - 10https://gerrit.wikimedia.org/r/773279 (https://phabricator.wikimedia.org/T304321)
[16:19:48] <wikibugs>	 (03CR) 10David Caro: "Got a question" [puppet] - 10https://gerrit.wikimedia.org/r/773274 (owner: 10Majavah)
[16:19:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1011.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:20:58] <wikibugs>	 (03PS1) 10David Caro: systemd:environment: fix typo in docs [puppet] - 10https://gerrit.wikimedia.org/r/773280
[16:21:19] <wikibugs>	 (03PS10) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117)
[16:25:43] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1011.eqiad.wmnet with reason: host reimage
[16:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:08] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1011.eqiad.wmnet with reason: host reimage
[16:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:58] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1032.eqiad.wmnet with OS bullseye
[16:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:03] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:32:07] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:34:54] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] systemd:environment: fix typo in docs [puppet] - 10https://gerrit.wikimedia.org/r/773280 (owner: 10David Caro)
[16:39:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1011.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:40:46] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1011.eqiad.wmnet with OS bullseye
[16:40:47] <dancy>	 jouncebot now
[16:40:47] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 19 minute(s)
[16:40:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:00] <dancy>	 I'm going to test /usr/local/bin/mwscript extensions/WikimediaMaintenance/dumpInterwiki.php on deploy1002
[16:42:40] <wikibugs>	 (03PS1) 10Jbond: external_cloud_endors: ensure we sintall the python3-conftool dependency [puppet] - 10https://gerrit.wikimedia.org/r/773283
[16:43:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] external_cloud_endors: ensure we sintall the python3-conftool dependency [puppet] - 10https://gerrit.wikimedia.org/r/773283 (owner: 10Jbond)
[16:44:55] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:45:39] <icinga-wm>	 RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:46:49] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06452 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:46:51] <icinga-wm>	 RECOVERY - Check systemd state on sretest1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:48:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: Remove v2 config API support [puppet] - 10https://gerrit.wikimedia.org/r/772938 (https://phabricator.wikimedia.org/T303770) (owner: 10RLazarus)
[16:48:48] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye
[16:48:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:29] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34524/console" [puppet] - 10https://gerrit.wikimedia.org/r/772938 (https://phabricator.wikimedia.org/T303770) (owner: 10RLazarus)
[16:51:00] <wikibugs>	 10SRE: Adding snwachukwu@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T304541 (10Snwachukwu)
[16:51:10] <wikibugs>	 (03PS1) 10Cwhite: profile: Rsyslog omkafka configs use new ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905)
[16:53:31] <wikibugs>	 (03CR) 10Majavah: kubeadm::helm: use systemd::environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773274 (owner: 10Majavah)
[16:54:17] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:56:34] <wikibugs>	 (03CR) 10David Caro: kubeadm::helm: use systemd::environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773274 (owner: 10Majavah)
[16:58:26] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS bullseye
[16:58:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:43] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1047.eqiad.wmnet with OS bullseye
[16:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:04] <wikibugs>	 (03CR) 10Majavah: kubeadm::helm: use systemd::environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773274 (owner: 10Majavah)
[16:59:23] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1033.eqiad.wmnet with OS bullseye
[16:59:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:43] <taavi>	 jouncebot: refresh
[17:01:44] <jouncebot>	 I refreshed my knowledge about deployments.
[17:01:52] <taavi>	 jouncebot: nowandnext
[17:01:52] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 58 minute(s)
[17:01:52] <jouncebot>	 In 0 hour(s) and 58 minute(s): 🚂🧪Trainsperiment Week Deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T1800)
[17:02:50] * taavi needs to learn that ^W in a browser based client closes the tab instead of deleting that word
[17:03:14] <arturo>	 taavi: happens to me all the time on irccloud -_-
[17:04:53] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:05:27] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:07:06] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1034.eqiad.wmnet with OS bullseye
[17:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:51] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:10:44] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1028.eqiad.wmnet with reason: host reimage
[17:10:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:55] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1028.eqiad.wmnet with reason: host reimage
[17:13:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:19] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage
[17:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:45] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4516 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:14:57] <wikibugs>	 10SRE, 10serviceops: Service puppet certificate due to expire - https://phabricator.wikimedia.org/T304543 (10jbond) p:05Triage→03High
[17:17:40] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage
[17:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:32] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1034.eqiad.wmnet with reason: host reimage
[17:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:42] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1 C: 03+2] envoy: Remove v2 config API support [puppet] - 10https://gerrit.wikimedia.org/r/772938 (https://phabricator.wikimedia.org/T303770) (owner: 10RLazarus)
[17:25:05] <wikibugs>	 10SRE, 10serviceops: Service puppet certificate due to expire - https://phabricator.wikimedia.org/T304543 (10jbond)
[17:25:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10jbond)
[17:25:57] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1034.eqiad.wmnet with reason: host reimage
[17:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:02] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus)
[17:26:08] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10envoy, 10serviceops, 10Patch-For-Review: Clean up Puppet support for Envoy v2 config API - https://phabricator.wikimedia.org/T303770 (10RLazarus) 05Open→03Resolved
[17:26:38] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] kubeadm::helm: use systemd::environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773274 (owner: 10Majavah)
[17:27:07] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:environment: enable export_systemd_env in cloud [puppet] - 10https://gerrit.wikimedia.org/r/771576 (owner: 10Jbond)
[17:27:17] <wikibugs>	 (03PS4) 10David Caro: P:environment: enable export_systemd_env in cloud [puppet] - 10https://gerrit.wikimedia.org/r/771576 (owner: 10Jbond)
[17:27:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff)
[17:27:49] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Merge whenever you are ready" [puppet] - 10https://gerrit.wikimedia.org/r/771576 (owner: 10Jbond)
[17:28:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10jbond) >>! In T304237#7797420, @Volans wrote: >>>! In T304237#7797398, @JMeybohm wrote: >>>>! In T304237#7795994, @Volans wrote: >>>...
[17:31:09] <wikibugs>	 (03PS4) 10Jbond: O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321)
[17:32:17] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1028.eqiad.wmnet with OS bullseye
[17:32:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:19] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471)
[17:38:35] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1033.eqiad.wmnet with OS bullseye
[17:38:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:02] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:46:46] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:47:48] <brennen>	 jouncebot now
[17:47:48] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 12 minute(s)
[17:48:15] <brennen>	 !log trainsperiment (T300203): starting prep for 1.39.0-wmf.4
[17:48:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:56] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[17:50:45] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1034.eqiad.wmnet with OS bullseye
[17:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:24] <wikibugs>	 (03PS1) 10Brennen Bearnes: testwikis wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773293
[17:51:26] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773293 (owner: 10Brennen Bearnes)
[17:52:28] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773293 (owner: 10Brennen Bearnes)
[17:52:32] <logmsgbot>	 !log brennen@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.4  refs T300203
[17:52:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:31] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) 05Open→03In progress
[17:55:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:55:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:55:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:52] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34525/console" [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) (owner: 10Cwhite)
[17:59:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:59:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:04] <jouncebot>	 dancy, hashar, brennen, dduvall, jeena, and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for 🚂🧪Trainsperiment Week Deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T1800).
[18:00:48] <dancy>	 In progress!
[18:01:21] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34526/console" [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) (owner: 10Cwhite)
[18:02:04] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) (owner: 10Cwhite)
[18:02:56] <icinga-wm>	 PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS2914/IPv6: Idle - NTT, AS2914/IPv4: Idle - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:04:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:04:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:18] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.371 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[18:05:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:05:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:05:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:39] <wikibugs>	 (03PS3) 10Majavah: kubeadm::helm: use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/773274
[18:05:41] <wikibugs>	 (03PS3) 10Majavah: kubeadm::helm: configure default HELMFILE_ENVIRONMENT [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532)
[18:05:54] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:06:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:06:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:52] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 59, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:06:54] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:07:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "lgtm!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773309 (owner: 10RhinosF1)
[18:10:21] <wikibugs>	 10SRE, 10Data-Engineering: Adding snwachukwu@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T304541 (10Ottomata)
[18:14:00] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:15:12] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 60, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:15:14] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:17:06] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.371 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[18:25:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) PT was pretty smooth, ES likely to be later today, closer to when their daily traffic cycle begins to trend downwards.
[18:25:11] <wikibugs>	 (03PS1) 10Jcrespo: swift: Create a new read-only role on mw account for backup taking [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T169144)
[18:25:30] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[18:28:12] <wikibugs>	 (03PS3) 10Bking: elasticsearch: upgrade cloudelastic to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763481 (https://phabricator.wikimedia.org/T301956) (owner: 10Gehel)
[18:28:59] <wikibugs>	 (03PS1) 10Cathal Mooney: Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553)
[18:29:01] <wikibugs>	 (03PS2) 10Jcrespo: swift: Create a new read-only role on mw account for backup taking [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108)
[18:29:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) (owner: 10Cathal Mooney)
[18:31:10] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] elasticsearch: upgrade cloudelastic to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763481 (https://phabricator.wikimedia.org/T301956) (owner: 10Gehel)
[18:31:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:31:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:10] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4194 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[18:32:59] <wikibugs>	 (03CR) 10Jcrespo: "Initial patch to start a conversation." [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) (owner: 10Jcrespo)
[18:36:33] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1035.eqiad.wmnet with OS bullseye
[18:36:35] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1036.eqiad.wmnet with OS bullseye
[18:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:37:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:37:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:25] <wikibugs>	 (03PS2) 10Cathal Mooney: Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553)
[18:38:44] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3548 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[18:42:13] <logmsgbot>	 !log brennen@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.4  refs T300203 (duration: 49m 41s)
[18:42:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:18] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[18:43:09] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10wmfdata-python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) p:05Medium→03Low Unclear whether or not we want this logic to live in Wmfdata-Python; i...
[18:43:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:43:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:37] <logmsgbot>	 !log brennen@deploy1002 Pruned MediaWiki: 1.38.0-wmf.26 (duration: 02m 05s)
[18:46:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr)
[18:47:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr)
[18:47:30] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[18:47:35] <brennen>	 !log trainsperiment (T300203): 1.39.0-wmf.4 on testwikis; proceeding to groups 0-2 with 15 minute intervals for watching logs
[18:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:39] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[18:48:07] <wikibugs>	 (03PS3) 10Cathal Mooney: Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553)
[18:48:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) @cmooney  i have connected spine switches to scs and updated netbox
[18:48:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:48:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:04] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) Hmm, the 1.21.1 build didn't work out of the box. Running `build-envoy-deb buster future` got me this:  ` [...] ./ci/run_envoy_docker.sh ./ci/do_ci.sh b...
[18:50:25] <RhinosF1>	 brennen: what's going on with the php-fpm alert above
[18:50:41] <RhinosF1>	 that's been noisy the last day
[18:50:58] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1035.eqiad.wmnet with reason: host reimage
[18:51:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:31] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1036.eqiad.wmnet with reason: host reimage
[18:51:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:52:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:52:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:07] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.4  refs T300203
[18:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:12] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[18:54:32] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elasticsearch: upgrade cloudelastic to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763481 (https://phabricator.wikimedia.org/T301956) (owner: 10Gehel)
[18:55:01] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1035.eqiad.wmnet with reason: host reimage
[18:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:52] <wikibugs>	 (03PS1) 10Arlolra: Add wikimedia.com to wgNoFollowDomainExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773302 (https://phabricator.wikimedia.org/T304555)
[18:56:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:56:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:31] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1036.eqiad.wmnet with reason: host reimage
[18:56:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:59] <wikibugs>	 (03PS1) 10Brennen Bearnes: group0 wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773304
[18:57:00] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773304 (owner: 10Brennen Bearnes)
[18:57:30] <brennen>	 (bit of weirdness trying out new `scap deploy-promote` above; this sync should effectively be a no-op.)
[18:57:42] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773304 (owner: 10Brennen Bearnes)
[18:57:59] <brennen>	 RhinosF1: good question
[18:58:53] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/773205 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[18:59:21] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.4  refs T300203
[18:59:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:26] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[19:00:12] <RhinosF1>	 brennen: unfortunately I don't have a good answer to go with it
[19:01:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:01:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:02:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:02:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:03:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:11] <wikibugs>	 (03PS1) 10Brennen Bearnes: group1 wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773326
[19:04:13] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773326 (owner: 10Brennen Bearnes)
[19:04:58] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773326 (owner: 10Brennen Bearnes)
[19:06:24] <wikibugs>	 (03PS1) 10Jbond: P:puppetdb: Add status page functionality to / [puppet] - 10https://gerrit.wikimedia.org/r/773327
[19:07:23] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34527/console" [puppet] - 10https://gerrit.wikimedia.org/r/773327 (owner: 10Jbond)
[19:08:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:14] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.4  refs T300203
[19:08:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetdb: Add status page functionality to / [puppet] - 10https://gerrit.wikimedia.org/r/773327 (owner: 10Jbond)
[19:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:19] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[19:08:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:08:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:08:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:07] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.4  refs T300203 (duration: 00m 52s)
[19:09:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:19] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:09:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:09:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:50] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic ES 6.8 upgrade - bking@cumin1001 - T301956
[19:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:55] <stashbot>	 T301956: Upgrade cloudelastic to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301956
[19:20:35] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1035.eqiad.wmnet with OS bullseye
[19:20:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Unify loopback filters between CR routers and L3 switches - https://phabricator.wikimedia.org/T304553 (10cmooney) To clarify the 'port' isn't an option on QFX even for UDP, although it allows you to define a term with that.  So I've changed...
[19:20:51] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:20:54] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1036.eqiad.wmnet with OS bullseye
[19:20:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:57] <wikibugs>	 (03PS1) 10Brennen Bearnes: all wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773330
[19:20:59] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773330 (owner: 10Brennen Bearnes)
[19:21:10] <wikibugs>	 (03PS1) 10Jdlrobson: Enable split A/B testing on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584)
[19:22:05] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773330 (owner: 10Brennen Bearnes)
[19:23:16] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic ES 6.8 upgrade - bking@cumin1001 - T301956
[19:23:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:24] <stashbot>	 T301956: Upgrade cloudelastic to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301956
[19:23:36] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.4  refs T300203
[19:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:45] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[19:24:18] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:25:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:25:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:25:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:26:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:31:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:04] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03226 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:37:24] <brennen>	 !log trainsperiment (T300203): 1.39.0-wmf.4 on all wikis; logs seem clean - end of train deployment activities for the week, unless bugs emerge
[19:37:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:29] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[19:38:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:38:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:38:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:20] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:41:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @jclark-ctr super thanks for that!  I'll open a task and start planning how we take care of the move.
[19:44:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:44:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:15] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic ES 6.8 upgrade - bking@cumin1001 - T301956
[19:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:21] <stashbot>	 T301956: Upgrade cloudelastic to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301956
[19:58:38] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) a:05RobH→03MoritzMuehlenhoff >>! In T297913#7788208, @MoritzMuehlenhoff wrote: > dumpsdata1007 is now running 5.16.11, can you please retest? >  > I'm not familiar with perccli myself, if there...
[20:00:05] <jouncebot>	 RoanKattouw and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220323T2000).
[20:00:05] <jouncebot>	 bd808 and Tran: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:33] <bd808>	 o/
[20:01:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr I'm not getting any output on port 20 or 29 of the scs-f8.  Are the two Junipers powered on?    If not can you double c...
[20:01:57] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic ES 6.8 upgrade - bking@cumin1001 - T301956
[20:02:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:02] <stashbot>	 T301956: Upgrade cloudelastic to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301956
[20:03:15] <bd808>	 I suppose I could technically do the deployment, but I haven't done scap things in quite some time so I would be more than happy to have RoanKattouw or urbanecm drive if they have time.
[20:04:15] <RoanKattouw>	 Yeah I can drive
[20:05:17] <wikibugs>	 (03PS3) 10Catrope: wikitech: Remove DynamicSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:05:21] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] wikitech: Remove DynamicSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:05:59] <bd808>	 thanks much RoanKattouw 
[20:06:06] <wikibugs>	 (03Merged) 10jenkins-bot: wikitech: Remove DynamicSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:06:32] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] Enable split A/B testing on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) (owner: 10Jdlrobson)
[20:07:36] <RoanKattouw>	 bd808: first patch is ready for testing on mwdebug1002
[20:07:58] <RoanKattouw>	 When you give me the go-ahead, I'll deploy it and queue up the next one
[20:08:33] <bd808>	 RoanKattouw: I verified that enwiki and mw.o still load. That's about all that I can test via mwdebug for wikitech things.
[20:08:59] <RoanKattouw>	 Ok
[20:09:09] <bd808>	 I don't have any fear of us crashing wikitech with these changes. Just of borking config in generall
[20:09:27] <wikibugs>	 (03PS4) 10Catrope: DynamicSidebar: remove from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:09:36] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] DynamicSidebar: remove from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:09:50] <RoanKattouw>	 Makes sense
[20:09:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:09:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:06] <logmsgbot>	 !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:771443|wikitech: Remove DynamicSidebar (T304006)]] (duration: 00m 52s)
[20:10:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:10] <stashbot>	 T304006: Undeploy DynamicSidebar extension from Wikimedia wikis (only Wikitech) - https://phabricator.wikimedia.org/T304006
[20:10:22] <wikibugs>	 (03Merged) 10jenkins-bot: DynamicSidebar: remove from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:11:29] <RoanKattouw>	 bd808: Alright, next one up for testing, I assume it's the same thing of only being able to test that production wikis are still up
[20:12:20] <bd808>	 RoanKattouw: yes, and the smoke tests look good to me. enwiki and mw.o again
[20:13:36] <logmsgbot>	 !log catrope@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:771444|DynamicSidebar: remove from CommonSettings (T304006)]] (duration: 00m 50s)
[20:13:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:14:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:14:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:26] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1037.eqiad.wmnet with OS bullseye
[20:14:27] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1038.eqiad.wmnet with OS bullseye
[20:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:31] <wikibugs>	 (03PS3) 10Catrope: DynamicSidebar: remove from InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771447 (owner: 10BryanDavis)
[20:14:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:44] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] DynamicSidebar: remove from InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771447 (owner: 10BryanDavis)
[20:15:53] <wikibugs>	 (03Merged) 10jenkins-bot: DynamicSidebar: remove from InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771447 (owner: 10BryanDavis)
[20:17:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:14] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic ES 6.8 upgrade - bking@cumin1001 - T301956
[20:18:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:18] <stashbot>	 T301956: Upgrade cloudelastic to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301956
[20:22:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:23:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:23:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:38] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: host reimage
[20:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:51] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1037.eqiad.wmnet with reason: host reimage
[20:28:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:59] <RoanKattouw>	 bd808: ok next one is up for testing
[20:31:11] <RoanKattouw>	 Sorry for the delay, I had to deal with a rebase conflict
[20:32:00] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1039.eqiad.wmnet with OS bullseye
[20:32:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:06] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: host reimage
[20:32:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:14] <bd808>	 RoanKattouw: smoke tests passed. ship it :)
[20:33:51] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1037.eqiad.wmnet with reason: host reimage
[20:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:16] <logmsgbot>	 !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:771447|DynamicSidebar: remove from InitialiseSettings]] (duration: 00m 51s)
[20:34:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:55] <Tran>	 (I'm here for my set of patches but need to restart real quick sorry!)
[20:35:07] <wikibugs>	 (03PS1) 10SBassett: Set StopForumSpam to enforce on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773340 (https://phabricator.wikimedia.org/T304111)
[20:35:13] <wikibugs>	 (03PS3) 10Catrope: DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448 (owner: 10BryanDavis)
[20:35:15] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448 (owner: 10BryanDavis)
[20:35:29] <wikibugs>	 (03PS4) 10Catrope: DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:35:31] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:37:13] <wikibugs>	 (03Merged) 10jenkins-bot: DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:38:22] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1037 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.79: Connection reset by peer https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[20:40:34] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.79. Check system logs on 10.64.20.79 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[20:40:35] <logmsgbot>	 !log catrope@deploy1002 Synchronized wmf-config/extension-list: Config: [[gerrit:771448|DynamicSidebar: remove unused extension (T304006)]] (duration: 00m 49s)
[20:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:40] <stashbot>	 T304006: Undeploy DynamicSidebar extension from Wikimedia wikis (only Wikitech) - https://phabricator.wikimedia.org/T304006
[20:41:00] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[20:41:34] <cjming>	 hi - I missed adding a config patch to this deployment window -- I'm happy to do it after the scheduled deployments are done if it's ok. It's config for beta cluster - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/773331
[20:43:26] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt1037 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.79: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP
[20:44:00] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1037 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.79: Connection reset by peer https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[20:44:16] <wikibugs>	 (03CR) 10Reedy: [C: 04-1] Enable split A/B testing on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) (owner: 10Jdlrobson)
[20:44:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:44:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:36] <RoanKattouw>	 cjming: go for it
[20:45:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:45:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:45:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:59] <Tran>	 <RoanKattouw> could we still do the ip info deploys too?
[20:46:03] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage
[20:46:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:11] <RoanKattouw>	 bd808: ok I think we're done, the docs say to also remove the repo from the make-wmf-branch script, but that script seems to have moved
[20:46:27] <RoanKattouw>	 So not sure what to do there, I'll ask in the releng channel
[20:46:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:46:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:48] <RoanKattouw>	 Tran: Yes I'll do yours next
[20:46:51] <Tran>	 thank you!
[20:46:53] <Reedy>	 RoanKattouw: mediawiki/tools/release
[20:47:01] <RoanKattouw>	 Sorry for the delay, I was trying to find my way through outdated docs
[20:47:18] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[20:47:25] <bd808>	 RoanKattouw: ack. I can take care of the make-wmf-branch bits too. Thanks for the deploy work!
[20:47:36] <RoanKattouw>	 Reedy: Sure but make-wmf-branch doesn't exist there anymore
[20:47:42] <RoanKattouw>	 And I don't see a list of extensions in that repo
[20:48:05] <Reedy>	 https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-release/settings.yaml
[20:48:32] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[20:48:53] <wikibugs>	 (03PS2) 10Catrope: Allow autoconfirmed users to view basic IP information [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders)
[20:48:55] <wikibugs>	 (03PS2) 10Clare Ming: Enable split A/B testing on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) (owner: 10Jdlrobson)
[20:48:59] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage
[20:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:09] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Allow autoconfirmed users to view basic IP information [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders)
[20:49:52] <cjming>	 thanks RoanKattouw - I'll wait til you're done - no rush
[20:49:53] <wikibugs>	 (03Merged) 10jenkins-bot: Allow autoconfirmed users to view basic IP information [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772408 (https://phabricator.wikimedia.org/T303858) (owner: 10Tchanders)
[20:49:53] <RoanKattouw>	 Reedy: Thanks, I'll update the docs
[20:51:26] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1037 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[20:51:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:51:44] <bd808>	 Reedy: should I remove it from make-tarball-release too? I'm not sure what the inclusion criteria is there.
[20:51:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:03] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1038.eqiad.wmnet with OS bullseye
[20:52:04] <Reedy>	 bd808: I don't use that script
[20:52:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:22] <Reedy>	 I suspect it's rotten
[20:52:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:52:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:52:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:47] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1037 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[20:52:48] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on cloudvirt1037 is OK: OK: synced at Wed 2022-03-23 20:52:46 UTC. https://wikitech.wikimedia.org/wiki/NTP
[20:53:06] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1037.eqiad.wmnet with OS bullseye
[20:53:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:16] <RoanKattouw>	 Tran: Your change is ready for testing on mwdebug1002, please test
[20:53:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:53:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:23] <Tran>	 RoanKattouw I think I may have messed up the order of operations. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/767216 might have to go in as well so that IPInfo is actually enabled on testwiki.
[20:54:45] <RoanKattouw>	 Oh I see
[20:54:59] <Tran>	 Sorry 🙇‍♂️
[20:55:05] <wikibugs>	 (03PS6) 10Catrope: Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders)
[20:55:09] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders)
[20:55:55] <wikibugs>	 (03Merged) 10jenkins-bot: Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders)
[20:56:36] <RoanKattouw>	 Tran: OK try now
[20:58:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:58:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:09] <Tran>	 I can confirm that the extension is installed and the groups have the rights I expect
[21:00:33] <Tran>	 So good? I think? Something else elsewhere is not what I expect but these patches have done what they should
[21:01:06] <RoanKattouw>	 What is not how you expect and what would it take to fix it?
[21:01:42] <Tran>	 Hm I thought we enabled IP Info on BetaFeatures earlier but I can't find it in my Special:Preferences
[21:01:55] <Tran>	 Ideally I would have been able to e2e test this as well by enabling it and confirming I could use the feature
[21:02:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:02:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:02:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:15] <RoanKattouw>	 Oh you might have to add it to the list of BetaFeatures
[21:04:47] <RoanKattouw>	 See the wgBetaFeaturesWhitelist (sic) setting
[21:05:31] <RoanKattouw>	 ( Tran )
[21:05:59] <Tran>	 oh nooooo I remember now. I think we were still doing that. Okay yes the patches do what I expect and unfortunately, iirc now, we have not yet finished adding IPInfo as a beta feature
[21:06:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:06:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:34] <brennen>	 noting that i might be rolling the train back for T304564
[21:06:34] <stashbot>	 T304564: MWException: `[title]` is not a valid file title. - https://phabricator.wikimedia.org/T304564
[21:06:43] <brennen>	 (after deploy window is clear)
[21:08:58] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1039.eqiad.wmnet with OS bullseye
[21:09:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:07] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:10:43] <Tran>	 Actually, RoanKattouw I think we have enabled it? `wmgUseIPInfo` has the `'testwiki' => true, // T260598` key
[21:10:45] <stashbot>	 T260598: Deploy IP Info extension to test.wikipedia.org - https://phabricator.wikimedia.org/T260598
[21:11:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:34] <RoanKattouw>	 Tran: Right, so IPInfo will be enabled on testwiki once I deploy this, but the BetaFeature will not be
[21:11:42] <RoanKattouw>	 Is that right? Should I pull the trigger and deploy?
[21:12:50] <Tran>	 I don't think it hurts to deploy? It does as expected and it doesn't break anything. It's just that w/o it being a BetaFeature, it won't be visible to users.
[21:13:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:13:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:13:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:02] <RoanKattouw>	 OK, then I'll deploy now
[21:14:10] <Tran>	 thank you!
[21:15:30] <logmsgbot>	 !log catrope@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:772408|Allow autoconfirmed users to view basic IP information (T303858)]] and [[gerrit:767216|Enable IPInfo on testwiki (T260598)]] (duration: 00m 50s)
[21:15:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:37] <stashbot>	 T303858: Make IP Info available to all users in the 'autoconfirmed' group on testwiki - https://phabricator.wikimedia.org/T303858
[21:15:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:15:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:36] <RoanKattouw>	 Alright we're all done
[21:18:03] <RoanKattouw>	 cjming: Feel free to +2 your labs-only change now, and after that brennen can take over and do the train rollback
[21:18:13] <Tran>	 thanks again!
[21:18:13] <cjming>	 will do - thanks Roan!
[21:18:30] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1040.eqiad.wmnet with OS bullseye
[21:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:46] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Enable split A/B testing on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) (owner: 10Jdlrobson)
[21:19:12] <brennen>	 i may hold rollback unless it recurs (last was 20:55 UTC), but thanks for ping.
[21:19:47] <wikibugs>	 (03Merged) 10jenkins-bot: Enable split A/B testing on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) (owner: 10Jdlrobson)
[21:24:00] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:773331|Enable split A/B testing on beta cluster (T301584)]] (duration: 00m 50s)
[21:24:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:24:06] <stashbot>	 T301584: Add rich snippet instrument to WikimediaEvents - https://phabricator.wikimedia.org/T301584
[21:24:40] <cjming>	 alrighty I'm done too
[21:25:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:25:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:26:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:26:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:27:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:46] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage
[21:31:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:17] <wikibugs>	 (03PS1) 10STran: Add IPInfo to BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802)
[21:35:20] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage
[21:35:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:07] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic ES 6.8 upgrade - bking@cumin1001 - T301956
[21:42:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:12] <stashbot>	 T301956: Upgrade cloudelastic to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301956
[21:48:30] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:55:18] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1040.eqiad.wmnet with OS bullseye
[21:55:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:05:22] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bullseye
[22:05:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:05:38] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1042.eqiad.wmnet with OS bullseye
[22:05:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:55] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:15:22] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Good to go. Thank you Zabe!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771469 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[22:18:48] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage
[22:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:13] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: host reimage
[22:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:48] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage
[22:23:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:13] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: host reimage
[22:24:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:25:49] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:26:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:27:53] <wikibugs>	 (03PS1) 10Brennen Bearnes: Revert "Handle broken media and thumb error in the same case for gallery" [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773314
[22:28:55] <wikibugs>	 (03CR) 10Jdlrobson: Enable split A/B testing on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773331 (https://phabricator.wikimedia.org/T301584) (owner: 10Jdlrobson)
[22:33:34] <wikibugs>	 (03Abandoned) 10Brennen Bearnes: Revert "Handle broken media and thumb error in the same case for gallery" [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773314 (owner: 10Brennen Bearnes)
[22:36:01] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bullseye
[22:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:46:04] <wikibugs>	 (03PS1) 10Brennen Bearnes: Revert 2 media gallery changes [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773317 (https://phabricator.wikimedia.org/T304564)
[22:47:05] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1041.eqiad.wmnet with OS bullseye
[22:47:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:48:07] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1042.eqiad.wmnet with OS bullseye
[22:48:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:16] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage
[22:49:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:30] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:54:16] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage
[22:54:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:55:58] <wikibugs>	 (03PS2) 10Krinkle: parser: Revert 2 media gallery changes [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773317 (https://phabricator.wikimedia.org/T304564) (owner: 10Brennen Bearnes)
[22:56:02] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] parser: Revert 2 media gallery changes [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773317 (https://phabricator.wikimedia.org/T304564) (owner: 10Brennen Bearnes)
[23:16:02] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1043.eqiad.wmnet with OS bullseye
[23:16:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:18] <cwhite>	 !log remove openjdk-8-jre from codfw logstash nodes T301770
[23:25:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:23] <stashbot>	 T301770: Remove obsolete Java 8 packages from logstash cluster - https://phabricator.wikimedia.org/T301770
[23:34:03] <brennen>	 !log trainsperiment (T300203): reverting to 1.39.0-wmf.3 on all wikis for T304564; will move forward again after a fix.
[23:34:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:34:09] <stashbot>	 T304564: MWException: `[title]` is not a valid file title. - https://phabricator.wikimedia.org/T304564
[23:34:09] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[23:35:07] <wikibugs>	 (03PS1) 10Brennen Bearnes: all wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773363
[23:35:09] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773363 (owner: 10Brennen Bearnes)
[23:36:26] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773363 (owner: 10Brennen Bearnes)
[23:38:02] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.3  refs T300203
[23:38:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:11] <wikibugs>	 (03PS1) 10RLazarus: envoy: Move upstream HTTP config into the new HttpProtocolOptions message [puppet] - 10https://gerrit.wikimedia.org/r/773364 (https://phabricator.wikimedia.org/T303230)
[23:40:36] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus)
[23:40:46] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) 05Stalled→03In progress p:05Low→03Medium
[23:43:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:43:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:45:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:45:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:45:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:45:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:45:38] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 04-2] parser: Revert 2 media gallery changes [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773317 (https://phabricator.wikimedia.org/T304564) (owner: 10Brennen Bearnes)
[23:46:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[23:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:48:32] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1045.eqiad.wmnet with OS bullseye
[23:48:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:48:37] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bullseye
[23:48:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:28] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bullseye
[23:51:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:56:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:59:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:59:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:59:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log