[00:18:30] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:38:51] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008068
[00:38:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008068 (owner: 10TrainBranchBot)
[01:01:32] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008068 (owner: 10TrainBranchBot)
[01:08:25] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:24:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:29:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:38:03] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:57:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T354015)', diff saved to https://phabricator.wikimedia.org/P58325 and previous config saved to /var/cache/conftool/dbconfig/20240304-025750-marostegui.json
[02:57:54] <stashbot>	 T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015
[02:58:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:59:59] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2024-03-04-023843-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008119 (https://phabricator.wikimedia.org/T350773)
[03:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:12:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P58326 and previous config saved to /var/cache/conftool/dbconfig/20240304-031256-marostegui.json
[03:28:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P58327 and previous config saved to /var/cache/conftool/dbconfig/20240304-032803-marostegui.json
[03:43:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T354015)', diff saved to https://phabricator.wikimedia.org/P58328 and previous config saved to /var/cache/conftool/dbconfig/20240304-034309-marostegui.json
[03:43:11] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[03:43:16] <stashbot>	 T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015
[03:43:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[03:43:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T354015)', diff saved to https://phabricator.wikimedia.org/P58329 and previous config saved to /var/cache/conftool/dbconfig/20240304-034333-marostegui.json
[04:18:30] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:51:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:08:25] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:08:33] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: m1 on db2132 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 393.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:22:39] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: m1 on db2132 is OK: OK slave_sql_lag Replication lag: 0.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:31:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:01:05] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9594767 (10Marostegui)
[06:01:35] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9594768 (10Marostegui) p:05Triage→03Medium
[06:01:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:05:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9594770 (10Marostegui)
[06:11:44] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9594787 (10Marostegui) While we verify the ssh account - @SBisson can you approve this request?  For  analytics-privatedata-users group, @odimitrijevic is this approved?
[06:12:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2118.codfw.wmnet
[06:13:05] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2118 [puppet] - 10https://gerrit.wikimedia.org/r/1008126 (https://phabricator.wikimedia.org/T358740)
[06:17:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[06:19:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2118 [puppet] - 10https://gerrit.wikimedia.org/r/1008126 (https://phabricator.wikimedia.org/T358740) (owner: 10Marostegui)
[06:19:56] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2118.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[06:21:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2118.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[06:21:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:21:11] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2118.codfw.wmnet
[06:21:42] <wikibugs>	 10ops-codfw, 06DBA, 10decommission-hardware, 13Patch-For-Review: decommission db2118.codfw.wmnet - https://phabricator.wikimedia.org/T358740#9594825 (10Marostegui) a:05Marostegui→03None
[06:22:02] <wikibugs>	 10ops-codfw, 06DBA, 10decommission-hardware, 13Patch-For-Review: decommission db2118.codfw.wmnet - https://phabricator.wikimedia.org/T358740#9594829 (10Marostegui) This is ready for #dc-ops
[06:27:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1186', diff saved to https://phabricator.wikimedia.org/P58330 and previous config saved to /var/cache/conftool/dbconfig/20240304-062703-root.json
[06:27:39] <wikibugs>	 (03PS1) 10Marostegui: db1186: Upgrade to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1008129
[06:29:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1186: Upgrade to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1008129 (owner: 10Marostegui)
[06:35:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 1%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58331 and previous config saved to /var/cache/conftool/dbconfig/20240304-063516-root.json
[06:50:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 5%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58332 and previous config saved to /var/cache/conftool/dbconfig/20240304-065021-root.json
[06:53:03] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:55:11] <kart_>	 marostegui: OK to deploy cxserver now?
[06:55:55] <kart_>	 OK. Lasg log was an hour back, starting it..
[06:56:12] <marostegui>	 kart_: yeah go for it
[06:56:15] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2024-03-04-023843-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008119 (https://phabricator.wikimedia.org/T350773) (owner: 10KartikMistry)
[06:56:21] <marostegui>	 kart_: that's just some automated schema change, there will be more coming :)
[06:56:39] <kart_>	 Thanks!
[06:59:36] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2024-03-04-023843-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008119 (https://phabricator.wikimedia.org/T350773) (owner: 10KartikMistry)
[07:01:04] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[07:01:30] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[07:01:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:05:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 10%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58333 and previous config saved to /var/cache/conftool/dbconfig/20240304-070526-root.json
[07:05:38] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[07:06:16] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[07:07:53] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[07:08:30] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[07:08:42] <kart_>	 !log Updated cxserver to 2024-03-04-023843-production (T350773)
[07:08:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:53] <stashbot>	 T350773: Remove preq and use node fetch - https://phabricator.wikimedia.org/T350773
[07:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:18:03] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:20:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58334 and previous config saved to /var/cache/conftool/dbconfig/20240304-072031-root.json
[07:32:16] <moritzm>	 !log installing tar security updates
[07:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58335 and previous config saved to /var/cache/conftool/dbconfig/20240304-073536-root.json
[07:50:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58336 and previous config saved to /var/cache/conftool/dbconfig/20240304-075041-root.json
[07:59:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007879 (owner: 10Ssingh)
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T0800).
[08:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:01:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007743 (owner: 10Volans)
[08:05:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58337 and previous config saved to /var/cache/conftool/dbconfig/20240304-080546-root.json
[08:34:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1007739 (https://phabricator.wikimedia.org/T358361) (owner: 10RLazarus)
[08:35:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007740 (https://phabricator.wikimedia.org/T358361) (owner: 10RLazarus)
[08:52:27] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] data-platform: fix superset available alerts [alerts] - 10https://gerrit.wikimedia.org/r/1007911 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi)
[08:58:51] <wikibugs>	 (03CR) 10Slyngshede: PKI: Switch alerts to use the x509 metric. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1007321 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[08:58:56] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] PKI: Switch alerts to use the x509 metric. [alerts] - 10https://gerrit.wikimedia.org/r/1007321 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:00:03] <wikibugs>	 (03Merged) 10jenkins-bot: PKI: Switch alerts to use the x509 metric. [alerts] - 10https://gerrit.wikimedia.org/r/1007321 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:01:50] <wikibugs>	 (03PS3) 10Ayounsi: Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614
[09:08:26] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:09:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi)
[09:13:53] <wikibugs>	 (03PS2) 10Clément Goubert: calico: Bump wikikube kube-controllers memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007912
[09:14:44] <wikibugs>	 (03CR) 10Clément Goubert: "Thanks, updated the commit message to remove reference to the kubemasters." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007912 (owner: 10Clément Goubert)
[09:22:11] <wikibugs>	 (03Abandoned) 10Mainframe98: GerritBot: Escape change number [puppet] - 10https://gerrit.wikimedia.org/r/1008001 (https://phabricator.wikimedia.org/T358940) (owner: 10Mainframe98)
[09:24:01] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] calico: Bump wikikube kube-controllers memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007912 (owner: 10Clément Goubert)
[09:26:50] <wikibugs>	 (03Merged) 10jenkins-bot: calico: Bump wikikube kube-controllers memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007912 (owner: 10Clément Goubert)
[09:27:27] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1007908 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi)
[09:27:35] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[09:27:38] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1007911 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi)
[09:28:01] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[09:28:56] <wikibugs>	 (03CR) 10DCausse: "Opensearch claims that the 2.0 client supports opensearch 1.0.0 (which should be equivalent to elastic 7.10.2) as long as we don't use fea" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro)
[09:30:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[09:30:32] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[09:32:43] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:openstack: rabbitmq: restrict clustering ports [puppet] - 10https://gerrit.wikimedia.org/r/1007864 (owner: 10Majavah)
[09:36:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] data-platform: fix superset available alerts [alerts] - 10https://gerrit.wikimedia.org/r/1007911 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi)
[09:36:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: fix spark alerts deployment [alerts] - 10https://gerrit.wikimedia.org/r/1007908 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi)
[09:38:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: Set log group to 'nagios' to resolve permission conflicts [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) (owner: 10Andrea Denisse)
[09:38:50] <wikibugs>	 (03Merged) 10jenkins-bot: data-engineering: fix spark alerts deployment [alerts] - 10https://gerrit.wikimedia.org/r/1007908 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi)
[09:38:55] <wikibugs>	 (03Merged) 10jenkins-bot: data-platform: fix superset available alerts [alerts] - 10https://gerrit.wikimedia.org/r/1007911 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi)
[09:43:26] <jinxer-wm>	 (SystemdUnitFailed) resolved: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:45:25] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:55:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:59:25] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:28:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T354015)', diff saved to https://phabricator.wikimedia.org/P58338 and previous config saved to /var/cache/conftool/dbconfig/20240304-102842-marostegui.json
[10:28:47] <stashbot>	 T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015
[10:31:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[10:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[10:43:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P58339 and previous config saved to /var/cache/conftool/dbconfig/20240304-104348-marostegui.json
[10:48:29] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on etherpad1003.eqiad.wmnet with reason: Shutdown and decommission old host
[10:48:43] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on etherpad1003.eqiad.wmnet with reason: Shutdown and decommission old host
[10:48:55] <wikibugs>	 06SRE, 10Wikimedia-Etherpad, 06collaboration-services, 13Patch-For-Review, 07User-notice: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9595343 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=eff489f2-c167-46cc-8ac4-c471b433a777) set by jelto@cum...
[10:53:23] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet2007-dev.codfw.wmnet with OS bookworm
[10:53:46] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet2008-dev.codfw.wmnet with OS bookworm
[10:58:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P58340 and previous config saved to /var/cache/conftool/dbconfig/20240304-105855-marostegui.json
[10:59:15] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1100)
[11:01:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[11:04:27] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1007888 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[11:05:57] <wikibugs>	 06SRE, 10ops-eqiad, 06DC-Ops: hw move: GPU from stat1005 to stat1010 - https://phabricator.wikimedia.org/T358763#9595411 (10BTullis) >>! In T358763#9591267, @Jclark-ctr wrote: > @BTullis  I will be available monday 10am (est) if that works for you  Yes please, that's great. I'll notify the users and make sur...
[11:06:24] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] shellbox: fix missing annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006943 (owner: 10Kamila Součková)
[11:08:01] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1007888 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[11:08:13] <claime>	 !log Depooling mw2314.codfw.wmnet,mw2315.codfw.wmnet,mw2316.codfw.wmnet,mw2320.codfw.wmnet,mw2321.codfw.wmnet,mw2322.codfw.wmnet for move to k8s - T351074
[11:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:25] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[11:08:28] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008070
[11:11:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1007888 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[11:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[11:12:01] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage
[11:12:12] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2008-dev.codfw.wmnet with reason: host reimage
[11:14:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T354015)', diff saved to https://phabricator.wikimedia.org/P58341 and previous config saved to /var/cache/conftool/dbconfig/20240304-111401-marostegui.json
[11:14:03] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[11:14:06] <stashbot>	 T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015
[11:14:18] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[11:14:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T354015)', diff saved to https://phabricator.wikimedia.org/P58342 and previous config saved to /var/cache/conftool/dbconfig/20240304-111424-marostegui.json
[11:14:35] <wikibugs>	 (03PS1) 10Btullis: Failover the analytics-hive service to an-coord1004 [dns] - 10https://gerrit.wikimedia.org/r/1008414 (https://phabricator.wikimedia.org/T303168)
[11:14:53] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage
[11:15:23] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cumin: fix insetup role report mapping [puppet] - 10https://gerrit.wikimedia.org/r/1007743 (owner: 10Volans)
[11:16:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9595436 (10MoritzMuehlenhoff)
[11:17:12] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2008-dev.codfw.wmnet with reason: host reimage
[11:17:51] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Set up mailing list ipbe-zh for zh.wikipedia - https://phabricator.wikimedia.org/T358011#9595443 (10Ladsgroup) Due to https://meta.wikimedia.org/wiki/Mailing_lists/Standardization the name of mailing list should be wikipedia-zh-ipbe. I create it now.
[11:18:03] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] shellbox: fix missing annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006943 (owner: 10Kamila Součková)
[11:18:03] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:18:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:18:28] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2314.codfw.wmnet with OS bullseye
[11:18:30] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2315.codfw.wmnet with OS bullseye
[11:18:32] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2316.codfw.wmnet with OS bullseye
[11:18:35] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2320.codfw.wmnet with OS bullseye
[11:18:37] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2321.codfw.wmnet with OS bullseye
[11:18:40] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2322.codfw.wmnet with OS bullseye
[11:19:30] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: fix missing annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006943 (owner: 10Kamila Součková)
[11:19:50] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Set up mailing list ipbe-zh for zh.wikipedia - https://phabricator.wikimedia.org/T358011#9595455 (10Ladsgroup) 05Open→03Resolved Done now: https://lists.wikimedia.org/postorius/lists/wikipedia-zh-ipbe.lists.wikimedia.org  Note that IPs/UA/email address sent to this email a...
[11:20:43] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.dns.netbox
[11:21:03] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply
[11:21:53] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[11:21:59] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[11:22:15] <wikibugs>	 (03CR) 10Btullis: superset: rollout the cache user isolation feature flags everywhere (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007854 (https://phabricator.wikimedia.org/T273850) (owner: 10Brouberol)
[11:22:37] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add cloud-private IPs for nwe cloudnet-devs - taavi@cumin1002"
[11:22:37] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[11:22:43] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[11:23:06] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[11:23:12] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[11:23:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:24:24] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad
[11:24:37] <logmsgbot>	 !log taavi@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add cloud-private IPs for nwe cloudnet-devs - taavi@cumin1002"
[11:24:37] <logmsgbot>	 !log taavi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[11:24:42] <wikibugs>	 (03PS1) 10Btullis: Failback hive services to an-coord1003 after restart [dns] - 10https://gerrit.wikimedia.org/r/1008415 (https://phabricator.wikimedia.org/T303168)
[11:24:59] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors
[11:25:00] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors
[11:25:10] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[11:25:17] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[11:25:40] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[11:25:46] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on all recursors
[11:25:50] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on all recursors
[11:26:50] <wikibugs>	 (03PS1) 10Clément Goubert: Move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008416 (https://phabricator.wikimedia.org/T351074)
[11:28:08] <wikibugs>	 (03CR) 10Dreamy Jazz: throttle: Allow for overriding temp account creation limits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) (owner: 10Kosta Harlan)
[11:28:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:29:18] <wikibugs>	 (03CR) 10Dreamy Jazz: throttle: Allow for overriding temp account creation limits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) (owner: 10Kosta Harlan)
[11:30:02] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[11:30:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:30:58] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[11:31:04] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[11:32:16] <Dreamy_Jazz>	 !log Re-starting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration
[11:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:04] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[11:33:10] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[11:33:42] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[11:33:48] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[11:34:43] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[11:34:46] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2320.codfw.wmnet with reason: host reimage
[11:34:49] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[11:34:53] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2322.codfw.wmnet with reason: host reimage
[11:34:59] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2321.codfw.wmnet with reason: host reimage
[11:35:07] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2315.codfw.wmnet with reason: host reimage
[11:35:10] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2316.codfw.wmnet with reason: host reimage
[11:35:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:35:15] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2314.codfw.wmnet with reason: host reimage
[11:35:36] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[11:37:25] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2320.codfw.wmnet with reason: host reimage
[11:38:30] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply
[11:39:04] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons.
[11:39:04] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[11:39:10] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[11:39:30] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2316.codfw.wmnet with reason: host reimage
[11:39:35] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[11:39:41] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[11:40:06] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[11:40:12] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[11:40:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:40:37] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[11:40:44] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[11:41:18] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[11:42:04] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2322.codfw.wmnet with reason: host reimage
[11:42:37] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2024-03-04-113412-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008420 (https://phabricator.wikimedia.org/T350773)
[11:42:52] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2008-dev.codfw.wmnet with OS bookworm
[11:43:05] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2007-dev.codfw.wmnet with OS bookworm
[11:43:46] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Remove an-tool1005 and associated hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1007857 (https://phabricator.wikimedia.org/T358706) (owner: 10Brouberol)
[11:44:14] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2314.codfw.wmnet with reason: host reimage
[11:44:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) ferm.service on kubernetes2019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:44:44] <wikibugs>	 (03PS2) 10Kosta Harlan: throttle: Allow for overriding temp account creation limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777)
[11:45:06] <wikibugs>	 (03CR) 10Kosta Harlan: throttle: Allow for overriding temp account creation limits (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) (owner: 10Kosta Harlan)
[11:45:17] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Thanks, looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1008406 (owner: 10Muehlenhoff)
[11:47:37] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2321.codfw.wmnet with reason: host reimage
[11:47:54] <claime>	 !log Disabling puppet on C:profile::firewall::log::ferm to deploy 1005978 - T354855
[11:47:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:57] <stashbot>	 T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855
[11:48:37] <claime>	 hmm no I'm gonna wait until my reimages are done, or it'll mess with them
[11:48:56] <claime>	 !log Disregard previous puppet disable message, waiting a bit T354855
[11:48:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:28] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons.
[11:49:57] <wikibugs>	 (03PS3) 10Majavah: Add some new networks for WMCS OVS testing [puppet] - 10https://gerrit.wikimedia.org/r/1007901 (https://phabricator.wikimedia.org/T358761)
[11:50:02] <wikibugs>	 (03PS1) 10Majavah: hieradata: lock down node-exporter on codfw1dev net-ovs [puppet] - 10https://gerrit.wikimedia.org/r/1008421
[11:50:07] <wikibugs>	 (03PS1) 10Majavah: O:wmcs: codfw1dev: net_ovs: add base neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1008422 (https://phabricator.wikimedia.org/T358761)
[11:50:12] <wikibugs>	 (03PS5) 10Btullis: Change the default systemd timer email source to noreply@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675)
[11:50:17] <wikibugs>	 (03PS5) 10Btullis: Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675)
[11:50:25] <wikibugs>	 (03PS5) 10Btullis: Allow kerberos::systemd::timer to use a custom email sender [puppet] - 10https://gerrit.wikimedia.org/r/1007578 (https://phabricator.wikimedia.org/T358675)
[11:50:40] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2315.codfw.wmnet with reason: host reimage
[11:51:06] <wikibugs>	 (03CR) 10Btullis: Change the default systemd timer email source to noreply@wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis)
[11:51:50] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: lock down node-exporter on codfw1dev net-ovs [puppet] - 10https://gerrit.wikimedia.org/r/1008421 (owner: 10Majavah)
[11:52:24] <wikibugs>	 (03CR) 10Btullis: Allow systemd::timer::job to send from a custom address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis)
[11:56:47] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2019 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:57:26] <claime>	 ^known, due to reimages in progress, I have a patch for this issue queued so I'm leaving it alone to see if the patch fixes it once the reimages are done
[11:59:13] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/1008414 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis)
[11:59:51] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Failover the analytics-hive service to an-coord1004 [dns] - 10https://gerrit.wikimedia.org/r/1008414 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis)
[12:00:40] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] Failback hive services to an-coord1003 after restart [dns] - 10https://gerrit.wikimedia.org/r/1008415 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis)
[12:01:53] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2320.codfw.wmnet with OS bullseye
[12:03:21] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2316.codfw.wmnet with OS bullseye
[12:05:58] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2322.codfw.wmnet with OS bullseye
[12:08:31] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2314.codfw.wmnet with OS bullseye
[12:10:50] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Failback hive services to an-coord1003 after restart [dns] - 10https://gerrit.wikimedia.org/r/1008415 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis)
[12:11:38] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2315.codfw.wmnet with OS bullseye
[12:12:28] <wikibugs>	 (03PS3) 10Brouberol: superset: rollout the cache user isolation feature flags everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007854 (https://phabricator.wikimedia.org/T273850)
[12:13:06] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2321.codfw.wmnet with OS bullseye
[12:13:14] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.remove-downtime for 6 hosts
[12:13:21] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 6 hosts
[12:14:07] <wikibugs>	 (03PS1) 10Btullis: Create the /usr/share/binfmts directory to fix JRE error [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008428 (https://phabricator.wikimedia.org/T358866)
[12:14:47] <claime>	 !log Running homer 'cr*codfw*' commit 'T351074'
[12:14:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:51] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[12:15:55] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Thanks!" [labs/private] - 10https://gerrit.wikimedia.org/r/1008408 (owner: 10Muehlenhoff)
[12:16:08] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Remove an-tool1005 and associated hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1007857 (https://phabricator.wikimedia.org/T358706) (owner: 10Brouberol)
[12:16:13] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1008422 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah)
[12:17:04] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] O:wmcs: codfw1dev: net_ovs: add base neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1008422 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah)
[12:17:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. When merged, please reassign T358866 to me. Then I can push a revert when I upgrade our Java 8 backports in the future (for th" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008428 (https://phabricator.wikimedia.org/T358866) (owner: 10Btullis)
[12:18:27] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Fix location of dummy keytab for an-airflow1007 [labs/private] - 10https://gerrit.wikimedia.org/r/1008408 (owner: 10Muehlenhoff)
[12:19:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Remove cloud_private_v4_set from cloudgw nftables definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney)
[12:19:14] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Create the /usr/share/binfmts directory to fix JRE error [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008428 (https://phabricator.wikimedia.org/T358866) (owner: 10Btullis)
[12:21:31] <logmsgbot>	 !log cgoubert@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2314.codfw.wmnet|mw2315.codfw.wmnet|mw2316.codfw.wmnet|mw2320.codfw.wmnet|mw2321.codfw.wmnet|mw2322.codfw.wmnet),cluster=kubernetes,service=kubesvc
[12:22:02] <claime>	 !log Uncordoning mw2314.codfw.wmnet mw2315.codfw.wmnet mw2316.codfw.wmnet mw2320.codfw.wmnet mw2321.codfw.wmnet mw2322.codfw.wmnet - T351074
[12:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:05] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[12:22:51] <claime>	 !log Disabling puppet on C:profile::firewall::log::ferm to deploy new ferm_status.py - T354855
[12:22:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:55] <stashbot>	 T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855
[12:23:53] <wikibugs>	 (03PS1) 10Majavah: openstack: neutron: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/1008438
[12:25:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1008438 (owner: 10Majavah)
[12:25:20] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] superset: rollout the cache user isolation feature flags everywhere (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007854 (https://phabricator.wikimedia.org/T273850) (owner: 10Brouberol)
[12:27:05] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] ferm: Check ferm.service status in ferm_status.py [puppet] - 10https://gerrit.wikimedia.org/r/1005978 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert)
[12:27:42] <wikibugs>	 (03CR) 10Btullis: Allow systemd::timer::job to send from a custom address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis)
[12:27:48] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] openstack: neutron: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/1008438 (owner: 10Majavah)
[12:28:32] <claime>	 !log Enabling puppet on kubernetes2019 to test new ferm_status.py - T354855
[12:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:35] <stashbot>	 T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855
[12:30:53] <claime>	 !log Enabling puppet on mw2322 to test new ferm_status.py - T354855
[12:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "the policy can be improved for better network security. I will make a proposal soon." [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[12:33:11] <claime>	 !log Enabling puppet on puppetboard2003 to test new ferm_status.py - T354855
[12:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) ferm.service on kubernetes2019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:35:27] <claime>	 ^this actually means it resolved, because another systemd unit of another type is failing
[12:35:48] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.decommission for hosts etherpad1003.eqiad.wmnet
[12:36:24] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply
[12:36:54] <claime>	 looks like the patched ferm_status.py works correctly, puppet doesn't restart the service on every run, the status looks good, re-enabling puppet fleet-wide moritzm 
[12:36:56] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply
[12:37:15] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply
[12:37:44] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply
[12:38:06] <claime>	 !log Re-enabling puppet on C:profile::firewall::log::ferm to deploy new ferm_status.py - T354855
[12:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:09] <stashbot>	 T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855
[12:39:24] <moritzm>	 claime: great, sgtm
[12:41:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[12:43:34] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: etherpad1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jelto@cumin1002"
[12:43:54] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] Move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008416 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[12:44:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855#9595712 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Deployed, puppet now restarts ferm.service if the sy...
[12:45:15] <claime>	 !log Depooling mw1350.eqiad.wmnet,mw1351.eqiad.wmnet,mw1352.eqiad.wmnet,mw1353.eqiad.wmnet,mw1354.eqiad.wmnet for move to kubernetes - T351074
[12:45:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: etherpad1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jelto@cumin1002"
[12:45:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:45:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts etherpad1003.eqiad.wmnet
[12:45:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:20] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[12:45:35] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9595718 (10fnegri) @bking thanks for having a look! No rush really, I was...
[12:45:44] <wikibugs>	 (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1008429 (owner: 10L10n-bot)
[12:47:48] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008416 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[12:50:24] <wikibugs>	 (03PS1) 10Jelto: site.pp: remove old etherpad1003 host [puppet] - 10https://gerrit.wikimedia.org/r/1008444 (https://phabricator.wikimedia.org/T359047)
[12:50:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis)
[12:52:14] <wikibugs>	 (03CR) 10Muehlenhoff: Change the default systemd timer email source to noreply@wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis)
[12:52:56] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1350.eqiad.wmnet with OS bullseye
[12:52:59] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1351.eqiad.wmnet with OS bullseye
[12:53:02] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1352.eqiad.wmnet with OS bullseye
[12:53:04] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1353.eqiad.wmnet with OS bullseye
[12:53:07] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1354.eqiad.wmnet with OS bullseye
[12:54:54] <wikibugs>	 (03PS1) 10EoghanGaffney: [vrts] Remove ticket-test.wm.o and vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/1008445 (https://phabricator.wikimedia.org/T359041)
[12:55:12] <wikibugs>	 (03CR) 10Muehlenhoff: "I think you missed hieradata/hosts/etherpad1003.yaml?" [puppet] - 10https://gerrit.wikimedia.org/r/1008444 (https://phabricator.wikimedia.org/T359047) (owner: 10Jelto)
[12:55:54] <wikibugs>	 (03PS2) 10Jelto: site.pp: remove old etherpad1003 host [puppet] - 10https://gerrit.wikimedia.org/r/1008444 (https://phabricator.wikimedia.org/T359047)
[12:56:39] <wikibugs>	 (03CR) 10Jelto: "yes thanks! I removed the file in patch set 2." [puppet] - 10https://gerrit.wikimedia.org/r/1008444 (https://phabricator.wikimedia.org/T359047) (owner: 10Jelto)
[12:56:47] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2019 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:00:15] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[13:00:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] site.pp: remove old etherpad1003 host [puppet] - 10https://gerrit.wikimedia.org/r/1008444 (https://phabricator.wikimedia.org/T359047) (owner: 10Jelto)
[13:01:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 41.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:01:38] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] site.pp: remove old etherpad1003 host [puppet] - 10https://gerrit.wikimedia.org/r/1008444 (https://phabricator.wikimedia.org/T359047) (owner: 10Jelto)
[13:04:13] <claime>	 AppserversUnreachable is transient due to reimages in progress
[13:05:01] <claime>	 akosiaris: do we want to run parsoid hotter than web/api deployments? if we do, we should adapt the alert a bit
[13:05:33] <akosiaris>	 claime: no, I don't think we do
[13:06:10] <akosiaris>	 but I am in the process of migrating this week, so I think we might want to handle this alert a bit differently this week
[13:06:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:06:21] <claime>	 Right, that's why I was asking :)
[13:06:35] <wikibugs>	 (03PS1) 10Gmodena: eventstreams: change default num_workers to 0. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008446 (https://phabricator.wikimedia.org/T359051)
[13:06:40] <akosiaris>	 I 'll add more hosts and capacity today
[13:06:46] <claime>	 ack
[13:06:52] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1350.eqiad.wmnet with reason: host reimage
[13:06:59] <wikibugs>	 (03PS1) 10EoghanGaffney: [vrts] Remove ticket-test.wm.o and vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1008447 (https://phabricator.wikimedia.org/T359041)
[13:07:07] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1352.eqiad.wmnet with reason: host reimage
[13:07:13] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1353.eqiad.wmnet with reason: host reimage
[13:07:27] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1351.eqiad.wmnet with reason: host reimage
[13:07:29] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1354.eqiad.wmnet with reason: host reimage
[13:09:53] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1350.eqiad.wmnet with reason: host reimage
[13:10:04] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:10:13] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:10:19] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9595821 (10MoritzMuehlenhoff) So, in order to move over the access from the existing kvc-wikimf account to kcvelaga we would need to do the following:...
[13:12:07] <wikibugs>	 06SRE, 10Wikimedia-Etherpad, 06collaboration-services, 13Patch-For-Review, 07User-notice: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9595828 (10Jelto)
[13:12:17] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1352.eqiad.wmnet with reason: host reimage
[13:12:26] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[13:12:28] <wikibugs>	 06SRE, 10Wikimedia-Etherpad, 06collaboration-services, 13Patch-For-Review, 07User-notice: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9595829 (10Jelto)
[13:12:56] <moritzm>	 !log installing jqueryui security updates
[13:12:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:07] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[13:14:12] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[13:14:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) ferm.service on mw1367:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:14:42] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[13:14:43] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1351.eqiad.wmnet with reason: host reimage
[13:15:27] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[13:15:31] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.dns.netbox
[13:15:54] <claime>	 ^The ferm.service error popping up is expected, it should resolve itself with the next puppet run
[13:16:15] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[13:17:08] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[13:17:18] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[13:17:23] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:17:31] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1353.eqiad.wmnet with reason: host reimage
[13:17:47] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[13:18:05] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:19:16] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:19:49] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:20:10] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1354.eqiad.wmnet with reason: host reimage
[13:20:15] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[13:20:35] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: Remove cloud_private_v4_set from cloudgw nftables definition [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney)
[13:21:04] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:21:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[13:22:03] <wikibugs>	 06SRE, 10Wikimedia-Etherpad, 06collaboration-services, 13Patch-For-Review, 07User-notice: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9595841 (10Jelto) 05Open→03Resolved >>! In T316421#9590106, @dcausse wrote: > Since the upgrade I believe that we are affected...
[13:22:28] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:23:03] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:23:26] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:24:24] <wikibugs>	 (03PS6) 10Btullis: Change the default systemd timer email source to noreply@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675)
[13:24:26] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=parsoid-php,name=parse102.*,dc=eqiad
[13:24:29] <wikibugs>	 (03PS6) 10Btullis: Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675)
[13:24:34] <wikibugs>	 (03PS6) 10Btullis: Allow kerberos::systemd::timer to use a custom email sender [puppet] - 10https://gerrit.wikimedia.org/r/1007578 (https://phabricator.wikimedia.org/T358675)
[13:25:00] <akosiaris>	 !log depool parse102.* from parsoid-php in eqiad T358752
[13:25:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:04] <stashbot>	 T358752: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752
[13:27:06] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=parsoid-php,name=parse101[012],dc=eqiad
[13:27:07] <wikibugs>	 (03PS1) 10Cathal Mooney: Add shell user for kcvelaga, mirroring kcv-wikimf [puppet] - 10https://gerrit.wikimedia.org/r/1008450 (https://phabricator.wikimedia.org/T358658)
[13:27:20] <wikibugs>	 (03PS1) 10Btullis: Add a new deployment target in the beta cluster [dumps/scap] - 10https://gerrit.wikimedia.org/r/1008451 (https://phabricator.wikimedia.org/T325228)
[13:28:07] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=parsoid-php,name=parse101[012].eqiad.wmnet,dc=eqiad
[13:28:13] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1367 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:28:27] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [zuul/deploy@bb76c45]: (no justification provided)
[13:28:53] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:28:57] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1350.eqiad.wmnet with OS bullseye
[13:29:14] <jnuche>	 ^^ test deploy to new host, forgot to add message, please ignore
[13:30:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis)
[13:31:07] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.03.04 - 2024.03.24), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9595872 (10BTullis) a:03BTullis
[13:31:22] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.03.04 - 2024.03.24), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9595870 (10BTullis) Moving this into our current milestone, as we are currently working on tes...
[13:31:28] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1352.eqiad.wmnet with OS bullseye
[13:31:38] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.03.04 - 2024.03.24), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9595880 (10BTullis)
[13:32:06] <wikibugs>	 (03CR) 10Btullis: Change the default systemd timer email source to noreply@wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis)
[13:32:14] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Change the default systemd timer email source to noreply@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis)
[13:33:01] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [zuul/deploy@bb76c45]: (no justification provided) (duration: 04m 33s)
[13:33:35] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1351.eqiad.wmnet with OS bullseye
[13:33:48] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2024.03.04 - 2024.03.24): Alert in need of triage: Number of requests triggering circuit breakers due to excessive memory usage (instance graphite1005) - https://phabricator.wikimedia.org/T357614#9595890 (10Gehel)
[13:33:54] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:34:42] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9595893 (10Gehel)
[13:35:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1008450 (https://phabricator.wikimedia.org/T358658) (owner: 10Cathal Mooney)
[13:36:39] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1353.eqiad.wmnet with OS bullseye
[13:37:29] <wikibugs>	 (03PS2) 10Cathal Mooney: Add shell user for kcvelaga, mirroring kcv-wikimf [puppet] - 10https://gerrit.wikimedia.org/r/1008450 (https://phabricator.wikimedia.org/T358658)
[13:38:28] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Move 5 eqiad parsoid servers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008452 (https://phabricator.wikimedia.org/T358752)
[13:39:18] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1354.eqiad.wmnet with OS bullseye
[13:39:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) ferm.service on mw1367:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:41:09] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Move 5 eqiad parsoid servers to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1008452 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris)
[13:41:31] <claime>	 !log Running homer 'cr*eqiad*' commit 'T351074'
[13:41:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:36] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[13:46:21] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1367 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:47:21] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Add a new deployment target in the beta cluster [dumps/scap] - 10https://gerrit.wikimedia.org/r/1008451 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis)
[13:47:36] <logmsgbot>	 !log cgoubert@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1350.eqiad.wmnet|mw1351.eqiad.wmnet|mw1352.eqiad.wmnet|mw1353.eqiad.wmnet|mw1354.eqiad.wmnet),cluster=kubernetes,service=kubesvc
[13:47:55] <claime>	 !log Uncordoning mw1351.eqiad.wmnet mw1352.eqiad.wmnet mw1353.eqiad.wmnet mw1354.eqiad.wmnet - T351074
[13:47:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:59] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[13:48:03] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:49:03] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[13:49:16] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[13:49:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T357189)', diff saved to https://phabricator.wikimedia.org/P58343 and previous config saved to /var/cache/conftool/dbconfig/20240304-134922-arnaudb.json
[13:49:26] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[13:49:59] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008071
[13:50:28] <wikibugs>	 07Puppet, 06SRE, 10Observability-Alerting, 10Puppet-Infrastructure: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720#9595942 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Optimistically resolving since we've moved to prometheus-based alert...
[13:50:54] <wikibugs>	 (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] Add a new deployment target in the beta cluster [dumps/scap] - 10https://gerrit.wikimedia.org/r/1008451 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis)
[13:51:07] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1010.eqiad.wmnet with reason: re-image
[13:51:33] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1010.eqiad.wmnet with reason: re-image
[13:51:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[13:51:46] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9595953 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=63abb5d8-03a7-48ae-abcc-214900c13c28) set by akosiaris@cumin1002 for 2:00:0...
[13:54:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T357189)', diff saved to https://phabricator.wikimedia.org/P58344 and previous config saved to /var/cache/conftool/dbconfig/20240304-135446-arnaudb.json
[13:54:51] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[13:56:04] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Move 8 eqiad parsoid servers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008452 (https://phabricator.wikimedia.org/T358752)
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1400).
[14:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:02:51] <wikibugs>	 (03PS3) 10Ssingh: dns::auth: move all service state management to confd [puppet] - 10https://gerrit.wikimedia.org/r/1007918 (https://phabricator.wikimedia.org/T347054)
[14:04:12] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1568/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007918 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[14:04:40] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9596013 (10ABran-WMF)
[14:04:53] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9596015 (10bking)  > @bking what if we release spicerack with the change...
[14:05:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "duh, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1008452 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris)
[14:09:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P58345 and previous config saved to /var/cache/conftool/dbconfig/20240304-140952-arnaudb.json
[14:11:47] <icinga-wm_>	 PROBLEM - MariaDB Replica SQL: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Could not execute Write_rows_v1 event on table nlwiki.recentchanges: Index for table recentchanges is corrupt: try to repair it, Error_code: 1034: handler error HA_ERR_CRASHED: the events master log db1155-bin.001893, end_log_pos 431898912 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depoolin
[14:11:47] <icinga-wm_>	 ca
[14:12:30] <rxy>	 Any gerrit admin around? Could you please add me to `Trusted-Contributors` (2021f25e7515187a81d51f8fe14dd6f25617cce0) ?
[14:12:52] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1010.eqiad.wmnet with OS bullseye
[14:13:06] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9596041 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1010.eqiad.wmnet with OS bullseye
[14:13:30] <wikibugs>	 (03CR) 10Muehlenhoff: LDAPBackend: Implement limit checks for UID (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[14:16:25] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[14:16:54] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2158.codfw.wmnet with reason: Silence for maintenance T356240
[14:17:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9596058 (10MoritzMuehlenhoff)
[14:17:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2158.codfw.wmnet with reason: Silence for maintenance T356240
[14:17:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 ', diff saved to https://phabricator.wikimedia.org/P58346 and previous config saved to /var/cache/conftool/dbconfig/20240304-141730-arnaudb.json
[14:17:43] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2158.codfw.wmnet
[14:18:39] <wikibugs>	 06SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9596066 (10Volans) @wiki_willy yes, if we go that way then I guess a separate tab on the accounting sheet with both asset tags (chassis and motherboard), compiled only for the hosts that have had th...
[14:19:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'For maint', diff saved to https://phabricator.wikimedia.org/P58347 and previous config saved to /var/cache/conftool/dbconfig/20240304-141913-ladsgroup.json
[14:19:35] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db2156.codfw.wmnet onto db2194.codfw.wmnet
[14:20:01] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:21:49] <moritzm>	 !og installing glib2.0 security updates
[14:22:06] <wikibugs>	 (03CR) 10Majavah: LDAPBackend: Implement limit checks for UID (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[14:22:42] <wikibugs>	 (03CR) 10Cathal Mooney: Remove cloud_private_v4_set from cloudgw nftables definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney)
[14:22:46] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2158.codfw.wmnet
[14:22:58] <wikibugs>	 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9596083 (10Jhancock.wm) @ABran-WMF I'll be here for that.
[14:23:40] <wikibugs>	 (03CR) 10Muehlenhoff: LDAPBackend: Implement limit checks for UID (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[14:24:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P58348 and previous config saved to /var/cache/conftool/dbconfig/20240304-142459-arnaudb.json
[14:25:31] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1010.eqiad.wmnet with reason: host reimage
[14:27:30] <wikibugs>	 (03CR) 10Btullis: elastic: add elastic2088-2109 to production role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[14:27:36] <sukhe>	 !log disable puppet on A:lvs to merge CR 1007879
[14:27:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:49] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1010.eqiad.wmnet with reason: host reimage
[14:28:07] <wikibugs>	 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9596105 (10ABran-WMF) I'll depool the node around 15:55 UTC then and will wait for your confirmation to repool it
[14:28:52] <sukhe>	 !log reprepro -C component/pybal include bullseye-wikimedia pybal_1.15.14_amd64.changes
[14:28:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:12] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "pybal: do not install from component" [puppet] - 10https://gerrit.wikimedia.org/r/1007879 (owner: 10Ssingh)
[14:29:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58349 and previous config saved to /var/cache/conftool/dbconfig/20240304-142921-arnaudb.json
[14:29:32] <wikibugs>	 (03PS2) 10Ssingh: Revert "pybal: do not install from component" [puppet] - 10https://gerrit.wikimedia.org/r/1007879
[14:29:52] <wikibugs>	 (03CR) 10Ssingh: Revert "pybal: do not install from component" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007879 (owner: 10Ssingh)
[14:30:04] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] [vrts] Remove ticket-test.wm.o and vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/1008445 (https://phabricator.wikimedia.org/T359041) (owner: 10EoghanGaffney)
[14:30:19] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Failback hive services to an-coord1003 after restart [dns] - 10https://gerrit.wikimedia.org/r/1008415 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis)
[14:30:20] <taavi>	 !log manually update PCC facts from puppetserver1001 to pick up cloudnet2007/8-dev os upgrade
[14:30:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:51] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2 C: 03+2] Revert "pybal: do not install from component" [puppet] - 10https://gerrit.wikimedia.org/r/1007879 (owner: 10Ssingh)
[14:30:56] <wikibugs>	 (03Abandoned) 10Reedy: captchaloop: Generate old and new captchas [puppet] - 10https://gerrit.wikimedia.org/r/990715 (owner: 10Reedy)
[14:30:58] <icinga-wm_>	 RECOVERY - MariaDB Replica SQL: s2 on clouddb1014 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:32:08] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:33:14] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9596119 (10KCVelaga_WMF) @MoritzMuehlenhoff When I change my email to wikimedia.org for the developer account, I am encountering a...
[14:34:26] <wikibugs>	 (03PS4) 10Bking: elastic: add elastic2088-2109 to production role [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878)
[14:35:45] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] [vrts] Remove ticket-test.wm.o and vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1008447 (https://phabricator.wikimedia.org/T359041) (owner: 10EoghanGaffney)
[14:36:31] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: multiversion-base: rebuild to include new php-luasandbox [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008457 (https://phabricator.wikimedia.org/T358867)
[14:36:46] <wikibugs>	 (03CR) 10Bking: elastic: add elastic2088-2109 to production role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[14:37:01] <wikibugs>	 (03PS5) 10Bking: elastic: add elastic2088-2109 to production role [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878)
[14:37:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] multiversion-base: rebuild to include new php-luasandbox [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008457 (https://phabricator.wikimedia.org/T358867) (owner: 10Giuseppe Lavagetto)
[14:38:03] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:10] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9596140 (10MoritzMuehlenhoff) @KCVelaga_WMF : That is expected, your kcvelaga account isn't yet part of the cn=wmf LDAP group, it...
[14:38:14] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[14:40:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T357189)', diff saved to https://phabricator.wikimedia.org/P58350 and previous config saved to /var/cache/conftool/dbconfig/20240304-144005-arnaudb.json
[14:40:07] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[14:40:10] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[14:40:21] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9596151 (10KCVelaga_WMF) @MoritzMuehlenhoff Ah okay! Thanks for clarifying. Also, to answer your second question, all of my work i...
[14:40:22] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[14:43:09] <sukhe>	 !log sudo cumin -b1 -s 30 "A:lvs and not P{lvs2014*}" "run-puppet-agent --enable 'merging CR 1007879'"
[14:43:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:25] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance
[14:43:39] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance
[14:43:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T357189)', diff saved to https://phabricator.wikimedia.org/P58351 and previous config saved to /var/cache/conftool/dbconfig/20240304-144344-arnaudb.json
[14:44:25] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add shell user for kcvelaga, mirroring kcv-wikimf [puppet] - 10https://gerrit.wikimedia.org/r/1008450 (https://phabricator.wikimedia.org/T358658) (owner: 10Cathal Mooney)
[14:44:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) ferm.service on mw1453:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:44:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58352 and previous config saved to /var/cache/conftool/dbconfig/20240304-144426-arnaudb.json
[14:45:17] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1010.eqiad.wmnet with OS bullseye
[14:45:31] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9596165 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1010.eqiad.wmnet with OS bullseye comp...
[14:45:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Pass firewall range in profile::firewall syntax for remaining Airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1008406 (owner: 10Muehlenhoff)
[14:46:14] <wikibugs>	 (03PS2) 10Muehlenhoff: airflow: Remove ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/1008407
[14:47:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] airflow: Remove ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/1008407 (owner: 10Muehlenhoff)
[14:48:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T357189)', diff saved to https://phabricator.wikimedia.org/P58353 and previous config saved to /var/cache/conftool/dbconfig/20240304-144844-arnaudb.json
[14:48:48] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[14:50:18] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on stat1005.eqiad.wmnet with reason: Moving GPU from stat1005 to stat1010
[14:50:32] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on stat1005.eqiad.wmnet with reason: Moving GPU from stat1005 to stat1010
[14:50:40] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on stat1010.eqiad.wmnet with reason: Moving GPU from stat1005 to stat1010
[14:50:54] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on stat1010.eqiad.wmnet with reason: Moving GPU from stat1005 to stat1010
[14:53:30] <wikibugs>	 06SRE, 10ops-eqiad, 06DC-Ops: hw move: GPU from stat1005 to stat1010 - https://phabricator.wikimedia.org/T358763#9596240 (10BTullis) The two servers have been shut down and are ready for the hardware swap.
[14:53:51] <wikibugs>	 (03PS3) 10Muehlenhoff: airflow: Remove ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/1008407
[14:53:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9596237 (10cmooney) >>! In T358658#9596140, @MoritzMuehlenhoff wrote: > @KCVelaga_WMF : That is expected, your kcvelaga account is...
[14:54:16] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1453 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:54:31] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[14:58:03] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:58:19] <wikibugs>	 (03PS1) 10Majavah: openstack: neutron: add API support for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1008462 (https://phabricator.wikimedia.org/T326373)
[14:58:24] <wikibugs>	 (03PS1) 10Majavah: openstack: neutron: first attempt of installing ovs-agent [puppet] - 10https://gerrit.wikimedia.org/r/1008463 (https://phabricator.wikimedia.org/T326373)
[14:59:25] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:59:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58354 and previous config saved to /var/cache/conftool/dbconfig/20240304-145931-arnaudb.json
[15:00:11] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1008407 (owner: 10Muehlenhoff)
[15:00:33] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2117.codfw.wmnet with reason: Silence for maintenance
[15:00:37] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2117.codfw.wmnet with reason: Silence for maintenance
[15:01:46] <wikibugs>	 (03PS2) 10Majavah: openstack: neutron: add API support for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1008462 (https://phabricator.wikimedia.org/T326373)
[15:01:51] <wikibugs>	 (03PS2) 10Majavah: openstack: neutron: first attempt of installing ovs-agent [puppet] - 10https://gerrit.wikimedia.org/r/1008463 (https://phabricator.wikimedia.org/T326373)
[15:03:32] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1570/co" [puppet] - 10https://gerrit.wikimedia.org/r/1008462 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah)
[15:03:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P58356 and previous config saved to /var/cache/conftool/dbconfig/20240304-150350-arnaudb.json
[15:04:20] <_joe_>	 !log installing php-luasandbox update on mediawiki canaries T353414
[15:04:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:23] <stashbot>	 T353414: Build and deploy LuaSandbox 4.1.2 - https://phabricator.wikimedia.org/T353414
[15:09:15] <wikibugs>	 (03PS1) 10Bking: flink-kubernetes-operator: change flink download URL [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008486 (https://phabricator.wikimedia.org/T358879)
[15:09:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) ferm.service on mw1453:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:13:17] <wikibugs>	 (03CR) 10Stevemunene: "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[15:13:22] <wikibugs>	 (03PS3) 10Majavah: openstack: neutron: first attempt of installing ovs-agent [puppet] - 10https://gerrit.wikimedia.org/r/1008463 (https://phabricator.wikimedia.org/T326373)
[15:13:31] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] flink-kubernetes-operator: change flink download URL [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008486 (https://phabricator.wikimedia.org/T358879) (owner: 10Bking)
[15:13:35] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[15:13:41] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[15:14:29] <wikibugs>	 (03PS3) 10Eevans: restbase: provision restbase1037-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005593 (https://phabricator.wikimedia.org/T354560)
[15:14:34] <wikibugs>	 (03PS3) 10Eevans: restbase: provision restbase1038-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005594 (https://phabricator.wikimedia.org/T354560)
[15:14:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58357 and previous config saved to /var/cache/conftool/dbconfig/20240304-151436-arnaudb.json
[15:14:39] <wikibugs>	 (03PS3) 10Eevans: restbase: provision restbase1039-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005595 (https://phabricator.wikimedia.org/T354560)
[15:14:47] <wikibugs>	 (03PS3) 10Eevans: restbase: provision restbase1040-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005596 (https://phabricator.wikimedia.org/T354560)
[15:14:55] <wikibugs>	 (03PS3) 10Eevans: restbase: provision restbase1041-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005597 (https://phabricator.wikimedia.org/T354560)
[15:15:03] <wikibugs>	 (03PS3) 10Eevans: restbase: provision restbase1042-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005598 (https://phabricator.wikimedia.org/T354560)
[15:15:36] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: add elastic2088-2109 to production role [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[15:17:40] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: provision restbase1037-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005593 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans)
[15:17:56] <wikibugs>	 (03CR) 10Brouberol: elastic: add elastic2088-2109 to production role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[15:18:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P58358 and previous config saved to /var/cache/conftool/dbconfig/20240304-151856-arnaudb.json
[15:19:20] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-mcrouter: adjust resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008487
[15:20:30] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:openstack: rabbitmq: remove cinder-backups term [puppet] - 10https://gerrit.wikimedia.org/r/1007295 (https://phabricator.wikimedia.org/T344065) (owner: 10Majavah)
[15:20:53] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: adjust resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008487 (owner: 10Effie Mouzeli)
[15:21:05] <wikibugs>	 (03PS9) 10Majavah: P:openstack: rabbitmq: use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998419
[15:22:31] <wikibugs>	 (03Merged) 10jenkins-bot: mw-mcrouter: adjust resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008487 (owner: 10Effie Mouzeli)
[15:22:38] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1573/co" [puppet] - 10https://gerrit.wikimedia.org/r/998419 (owner: 10Majavah)
[15:23:30] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[15:23:38] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[15:24:09] <wikibugs>	 (03CR) 10Dzahn: "the removal from site.pp needs to happen after the decom cookbook finished. but at the same time it will warn you about remaining strings " [puppet] - 10https://gerrit.wikimedia.org/r/1008447 (https://phabricator.wikimedia.org/T359041) (owner: 10EoghanGaffney)
[15:24:15] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1453 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:25:34] <wikibugs>	 (03Abandoned) 10Dzahn: site: remove etherpad on bullseye machine [puppet] - 10https://gerrit.wikimedia.org/r/1003075 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn)
[15:26:05] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack: rabbitmq: use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998419 (owner: 10Majavah)
[15:27:37] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1008407 (owner: 10Muehlenhoff)
[15:29:12] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1011.eqiad.wmnet with OS bullseye
[15:29:26] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9596436 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1011.eqiad.wmnet with OS bullseye
[15:30:36] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1037.eqiad.wmnet with reason: Bootstrapping — T354560
[15:30:43] <stashbot>	 T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560
[15:30:50] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1037.eqiad.wmnet with reason: Bootstrapping — T354560
[15:31:24] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-mcrouter: adjust resources (cpu) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008489
[15:32:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 07ARM support: Adoption of aarch64 (aka arm64) in WMF production? (SRE Summit 2022 Session) - https://phabricator.wikimedia.org/T320811#9596480 (10MoritzMuehlenhoff) p:05Triage→03Medium
[15:34:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T357189)', diff saved to https://phabricator.wikimedia.org/P58359 and previous config saved to /var/cache/conftool/dbconfig/20240304-153403-arnaudb.json
[15:34:06] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[15:34:07] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[15:34:18] <wikibugs>	 (03PS2) 10Effie Mouzeli: mw-mcrouter: adjust resources 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008489
[15:34:19] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[15:34:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T357189)', diff saved to https://phabricator.wikimedia.org/P58360 and previous config saved to /var/cache/conftool/dbconfig/20240304-153425-arnaudb.json
[15:35:57] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: adjust resources 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008489 (owner: 10Effie Mouzeli)
[15:37:02] <wikibugs>	 (03Merged) 10jenkins-bot: mw-mcrouter: adjust resources 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008489 (owner: 10Effie Mouzeli)
[15:38:11] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[15:38:18] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[15:39:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T357189)', diff saved to https://phabricator.wikimedia.org/P58361 and previous config saved to /var/cache/conftool/dbconfig/20240304-153933-arnaudb.json
[15:39:37] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[15:40:21] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Raise replicas for 55% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006526 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert)
[15:40:44] <claime>	 jouncebot: nowandnext
[15:40:44] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 49 minute(s)
[15:40:44] <jouncebot>	 In 0 hour(s) and 49 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1630)
[15:40:54] <claime>	 a'ight *cracks knuckles*
[15:41:16] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 55% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006526 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert)
[15:42:06] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1011.eqiad.wmnet with reason: host reimage
[15:43:05] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[15:43:10] <wikibugs>	 (03CR) 10Bking: [C: 03+2] flink-kubernetes-operator: change flink download URL [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008486 (https://phabricator.wikimedia.org/T358879) (owner: 10Bking)
[15:43:26] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[15:43:34] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[15:43:48] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[15:43:51] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db2132.codfw.wmnet with reason: Silence for maintenance
[15:43:56] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[15:44:05] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db2132.codfw.wmnet with reason: Silence for maintenance
[15:44:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[15:44:21] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[15:44:31] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1011.eqiad.wmnet with reason: host reimage
[15:44:52] <wikibugs>	 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9596555 (10ABran-WMF) server downtimed
[15:45:28] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] flink-kubernetes-operator: change flink download URL [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008486 (https://phabricator.wikimedia.org/T358879) (owner: 10Bking)
[15:46:35] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[15:46:45] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:47:39] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[15:49:04] <sukhe>	 ^ bunch of parse and elastic failures
[15:51:45] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:52:39] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (4) Elasticsearch instance elastic2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[15:52:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (12) Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[15:52:50] <wikibugs>	 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9596598 (10Jhancock.wm) it's been swapped.
[15:52:59] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2156.codfw.wmnet onto db2194.codfw.wmnet
[15:53:09] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 55% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1006527 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert)
[15:54:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P58362 and previous config saved to /var/cache/conftool/dbconfig/20240304-155439-arnaudb.json
[15:56:14] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=parsoid-php,dc=codfw,name=parse200[1-5].codfw.wmnet
[15:56:53] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2124.codfw.wmnet with reason: Silence for maintenance T356240
[15:57:05] <akosiaris>	 !log depool parse200[1-5] from parsoid from re-imaging. T358752
[15:57:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2124.codfw.wmnet with reason: Silence for maintenance T356240
[15:57:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:08] <stashbot>	 T358752: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752
[15:57:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (16) Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[15:57:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 ', diff saved to https://phabricator.wikimedia.org/P58363 and previous config saved to /var/cache/conftool/dbconfig/20240304-155742-arnaudb.json
[15:57:53] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2124.codfw.wmnet
[15:58:07] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: service=parsoid-php,dc=codfw,name=parse200[1-5].codfw.wmnet
[15:58:34] <akosiaris>	 !log repool parse200[1-5] in parsoid. There are 2 canaries in that set, I 'll leave them for last. T358752.
[15:58:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:03] <akosiaris>	 !log depool parse2016-parse2020 from parsoid from re-imaging. T358752
[15:59:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:03] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498
[16:01:17] <wikibugs>	 10ops-codfw: Spare SSDs for titan2001 ? - https://phabricator.wikimedia.org/T359070 (10fgiunchedi)
[16:01:56] <wikibugs>	 10ops-codfw: Spare SSDs for titan2001 ? - https://phabricator.wikimedia.org/T359070#9596690 (10fgiunchedi)
[16:02:26] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1011.eqiad.wmnet with OS bullseye
[16:02:39] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) resolved: (4) Elasticsearch instance elastic2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[16:03:13] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9596708 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1011.eqiad.wmnet with OS bullseye comp...
[16:03:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Lacks a why and a what in the commit message." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 (owner: 10Effie Mouzeli)
[16:05:16] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1020.eqiad.wmnet with OS bullseye
[16:05:30] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9596718 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1020.eqiad.wmnet with OS bullseye
[16:07:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (16) Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[16:07:52] <wikibugs>	 10ops-codfw: Spare SSDs for titan2001 ? - https://phabricator.wikimedia.org/T359070#9596740 (10Jhancock.wm) I have this on hand:      - 8 x 300 GB SSD   - 3 x 600 GB SSD   - 3 x 800 GB SSD   - 1 x 1.6 TB SSD  Let me know which set you would like to go with.
[16:08:22] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008501 (https://phabricator.wikimedia.org/T128546)
[16:08:27] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] dns::auth: move all service state management to confd [puppet] - 10https://gerrit.wikimedia.org/r/1007918 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[16:09:34] <wikibugs>	 (03PS2) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008501 (https://phabricator.wikimedia.org/T128546)
[16:09:45] <wikibugs>	 06SRE, 10ops-codfw, 06DBA, 10decommission-hardware: decommission db2118.codfw.wmnet - https://phabricator.wikimedia.org/T358740#9596752 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:09:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P58365 and previous config saved to /var/cache/conftool/dbconfig/20240304-160945-arnaudb.json
[16:12:08] <sukhe>	 !log sudo cumin "A:dns-rec" "disable-puppet 'merging CR 1007918'": T347054
[16:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:11] <stashbot>	 T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054
[16:14:57] <wikibugs>	 (03PS1) 10Ladsgroup: Set two more wikis to read new for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008503 (https://phabricator.wikimedia.org/T351237)
[16:15:05] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Move 5 codfw parsoid servers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008504 (https://phabricator.wikimedia.org/T358752)
[16:15:39] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=parsoid-php,dc=codfw,name=parse201[6-9].codfw.wmnet
[16:15:46] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=parsoid-php,dc=codfw,name=parse2020.codfw.wmnet
[16:16:24] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic2093-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[16:16:57] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] dns::auth: move all service state management to confd [puppet] - 10https://gerrit.wikimedia.org/r/1007918 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[16:17:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (14) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[16:18:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58366 and previous config saved to /var/cache/conftool/dbconfig/20240304-161806-arnaudb.json
[16:18:11] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1020.eqiad.wmnet with reason: host reimage
[16:19:00] <wikibugs>	 (03PS3) 10BCornwall: slo_definitions: Switch to using haproxy_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606)
[16:19:13] <wikibugs>	 (03CR) 10BCornwall: [V: 03+2 C: 03+2] slo_definitions: Switch to using haproxy_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[16:19:22] <wikibugs>	 (03PS5) 10BCornwall: slo_definitions: Use trafficserver_backend_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606)
[16:19:28] <wikibugs>	 (03CR) 10BCornwall: [V: 03+2 C: 03+2] slo_definitions: Use trafficserver_backend_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[16:19:32] <wikibugs>	 10SRE-swift-storage: 2024-2025 ms swift capacity - https://phabricator.wikimedia.org/T359077 (10MatthewVernon)
[16:20:36] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1020.eqiad.wmnet with reason: host reimage
[16:20:48] <wikibugs>	 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9596886 (10cmooney)
[16:21:22] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=ntp
[16:21:24] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) resolved: (3) Elasticsearch instance elastic2091-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[16:21:32] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=ntp
[16:22:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (15) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[16:22:46] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[16:23:17] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=authdns-update
[16:24:42] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=authdns-update
[16:24:52] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T357189)', diff saved to https://phabricator.wikimedia.org/P58367 and previous config saved to /var/cache/conftool/dbconfig/20240304-162452-arnaudb.json
[16:24:54] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1228.eqiad.wmnet with reason: Maintenance
[16:24:56] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[16:25:02] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[16:25:08] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1228.eqiad.wmnet with reason: Maintenance
[16:25:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T357189)', diff saved to https://phabricator.wikimedia.org/P58368 and previous config saved to /var/cache/conftool/dbconfig/20240304-162514-arnaudb.json
[16:25:18] <wikibugs>	 (03PS1) 10Majavah: conntrackd: fix CLI installation [puppet] - 10https://gerrit.wikimedia.org/r/1008506
[16:26:03] <wikibugs>	 10ops-codfw: Spare SSDs for titan2001 ? - https://phabricator.wikimedia.org/T359070#9596941 (10fgiunchedi) Thank you @Jhancock.wm ! I'd like to go for the 1x 1.6TB SSD please to be added to the existing SSDs in titan2001
[16:26:27] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[16:27:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1008506 (owner: 10Majavah)
[16:27:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (14) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[16:27:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 ', diff saved to https://phabricator.wikimedia.org/P58369 and previous config saved to /var/cache/conftool/dbconfig/20240304-162755-arnaudb.json
[16:28:11] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2171.codfw.wmnet with reason: Silence for maintenance T356240
[16:28:25] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2171.codfw.wmnet with reason: Silence for maintenance T356240
[16:28:34] <wikibugs>	 (03CR) 10Majavah: "PCC: https://puppet-compiler.wmflabs.org/output/1008506/1574/cloudgw1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/1008506 (owner: 10Majavah)
[16:28:43] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2171.codfw.wmnet
[16:29:21] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org,service=ntp
[16:29:27] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org,service=ntp
[16:30:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T357189)', diff saved to https://phabricator.wikimedia.org/P58370 and previous config saved to /var/cache/conftool/dbconfig/20240304-163002-arnaudb.json
[16:30:05] <jouncebot>	 jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1630).
[16:30:08] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[16:30:22] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008501 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[16:31:03] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008501 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[16:32:20] <wikibugs>	 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9596970 (10Gehel)
[16:32:27] <wikibugs>	 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9596996 (10ABran-WMF) thanks, I've preventively reloaded haproxy.   Everything should be OK
[16:32:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (13) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[16:32:47] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis)
[16:33:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2171.codfw.wmnet
[16:33:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58371 and previous config saved to /var/cache/conftool/dbconfig/20240304-163311-arnaudb.json
[16:33:16] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org,service=authdns-update
[16:33:42] <sukhe>	 !log running dummy authdns-update
[16:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58372 and previous config saved to /var/cache/conftool/dbconfig/20240304-163358-arnaudb.json
[16:34:36] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org,service=authdns-update
[16:34:40] <sukhe>	 !log running dummy authdns-update
[16:34:41] <jinxer-wm>	 (ConfdResourceFailed) firing: confd resource _var_lib_dnsbox_authdns_ns2.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:34:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:46] <sukhe>	 hmm ok
[16:35:26] <wikibugs>	 (03PS2) 10Dbrant: Move account vanishing contact form to Meta wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005161 (https://phabricator.wikimedia.org/T343536)
[16:37:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "hey, what do you think about this approach? with this scheme, I think we can open holes only for the hosts that need them open." [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney)
[16:37:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] airflow: Remove ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/1008407 (owner: 10Muehlenhoff)
[16:37:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (12) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[16:39:04] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1020.eqiad.wmnet with OS bullseye
[16:39:17] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597019 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1020.eqiad.wmnet with OS bullseye comp...
[16:39:48] <wikibugs>	 (03PS1) 10Ssingh: hiera: dnsbox: update service_type names [puppet] - 10https://gerrit.wikimedia.org/r/1008510
[16:40:53] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1021.eqiad.wmnet with OS bullseye
[16:41:07] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1021.eqiad.wmnet with OS bullseye
[16:41:55] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Move 5 codfw parsoid servers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008504 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris)
[16:45:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P58373 and previous config saved to /var/cache/conftool/dbconfig/20240304-164508-arnaudb.json
[16:45:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:46:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Move 5 codfw parsoid servers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008504 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris)
[16:46:23] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: dnsbox: update service_type names [puppet] - 10https://gerrit.wikimedia.org/r/1008510 (owner: 10Ssingh)
[16:47:20] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir4001.ulsfo.wmnet
[16:47:34] <icinga-wm_>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:47:36] <icinga-wm_>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:47:36] <icinga-wm_>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:47:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[16:47:42] <icinga-wm_>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:47:42] <icinga-wm_>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:47:58] <icinga-wm_>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:48:10] <brett>	 That's me....
[16:48:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58374 and previous config saved to /var/cache/conftool/dbconfig/20240304-164816-arnaudb.json
[16:48:42] <icinga-wm_>	 RECOVERY - HTTPS non-canonical-redirect-3 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 277818 seconds left:Certificate *.wikipedia.bg valid until 2024-04-13 06:06:54 +0000 (expires in 39 days) https://wikitech.wikimedia.org/wiki/Ncredir
[16:48:42] <icinga-wm_>	 RECOVERY - HTTPS non-canonical-redirect-2 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 332897 seconds left:Certificate *.wikimania.com valid until 2024-05-25 10:21:04 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/Ncredir
[16:48:58] <icinga-wm_>	 RECOVERY - HTTPS non-canonical-redirect-4 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 322081 seconds left:Certificate *.wikispecies.net valid until 2024-05-25 08:20:38 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/Ncredir
[16:49:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58375 and previous config saved to /var/cache/conftool/dbconfig/20240304-164903-arnaudb.json
[16:49:19] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetboard: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1008513
[16:49:34] <jan_drewniak>	 Hey SRE, just FYI, I'm doing a scap sync and I got a "Host key verification failed" for parse1021.eqiad.wmnet 
[16:49:53] <Reedy>	 jan_drewniak: probably due to some reinstalls... CC akosiaris ^^
[16:51:35] <akosiaris>	 jan_drewniak: gimme a sec
[16:51:47] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=inactive; selector: service=parsoid-php,dc=codfw,name=parse2020.codfw.wmnet
[16:52:17] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org,service=authdns-ns2
[16:52:50] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=authdns-ns2
[16:53:43] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1021.eqiad.wmnet with reason: host reimage
[16:53:44] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org,service=authdns-ns2
[16:53:47] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=authdns-ns2
[16:53:57] <akosiaris>	 jan_drewniak: it should be ok now, you just fell in the time window between the host being in the dsh scap list and it being removed.
[16:54:15] <akosiaris>	 sorry about that, I should have set the host as inactive, not depooled.
[16:54:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) ferm.service on mw2384:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:54:37] <jan_drewniak>	 akosiaris: np, should I restart the scap sync or let it keep running?
[16:54:41] <jinxer-wm>	 (ConfdResourceFailed) resolved: confd resource _var_lib_dnsbox_authdns_ns2.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:54:56] <akosiaris>	 jan_drewniak: you can let it keep running, the host is no more a mediawiki host. 
[16:55:31] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] puppetboard: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1008513 (owner: 10Muehlenhoff)
[16:55:43] <jan_drewniak>	 gotcha
[16:55:48] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on elastic2107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:56:15] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1021.eqiad.wmnet with reason: host reimage
[16:56:49] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns2004.wikimedia.org,service=authdns-ns1
[16:57:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (8) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[16:57:46] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2004.wikimedia.org,service=authdns-ns1
[16:59:21] <sukhe>	 !log sudo cumin -b1 -s120 "A:dns-rec" "run-puppet-agent --enable 'merging CR 1007918'": finish rolling out confd state management: T347054
[16:59:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:25] <stashbot>	 T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054
[17:00:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P58376 and previous config saved to /var/cache/conftool/dbconfig/20240304-170015-arnaudb.json
[17:03:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58377 and previous config saved to /var/cache/conftool/dbconfig/20240304-170320-arnaudb.json
[17:04:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58378 and previous config saved to /var/cache/conftool/dbconfig/20240304-170408-arnaudb.json
[17:05:22] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1006516 (https://phabricator.wikimedia.org/T358483) (owner: 10Majavah)
[17:07:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (7) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[17:08:02] <wikibugs>	 06SRE, 10Prod-Kubernetes, 06serviceops: Kubernetes apiserver probe failures on restart - https://phabricator.wikimedia.org/T358936#9597218 (10RLazarus)
[17:09:07] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2037.mgmt.codfw.wmnet with reboot policy FORCED
[17:10:51] <James_F>	 jouncebot: nowandnext
[17:10:52] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 49 minute(s)
[17:10:52] <jouncebot>	 In 0 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1800)
[17:10:52] <jouncebot>	 In 0 hour(s) and 49 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1800)
[17:10:59] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] ZObjectStore::updateZObjectAsSystemUser: Also give wf-staff rights [extensions/WikiLambda] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007885 (owner: 10Jforrester)
[17:11:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED
[17:11:21] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED
[17:11:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007885 (owner: 10Jforrester)
[17:12:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (7) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[17:13:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:13:40] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T359086 (10phaultfinder)
[17:14:31] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1021.eqiad.wmnet with OS bullseye
[17:14:45] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1021.eqiad.wmnet with OS bullseye comp...
[17:15:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T357189)', diff saved to https://phabricator.wikimedia.org/P58379 and previous config saved to /var/cache/conftool/dbconfig/20240304-171521-arnaudb.json
[17:15:23] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance
[17:15:30] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[17:15:37] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance
[17:15:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T357189)', diff saved to https://phabricator.wikimedia.org/P58380 and previous config saved to /var/cache/conftool/dbconfig/20240304-171543-arnaudb.json
[17:15:52] <wikibugs>	 (03PS3) 10Jforrester: InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994831 (https://phabricator.wikimedia.org/T355462) (owner: 10Houseblaster)
[17:15:59] <wikibugs>	 (03Merged) 10jenkins-bot: ZObjectStore::updateZObjectAsSystemUser: Also give wf-staff rights [extensions/WikiLambda] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007885 (owner: 10Jforrester)
[17:16:16] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1022.eqiad.wmnet with OS bullseye
[17:16:32] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597325 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1022.eqiad.wmnet with OS bullseye
[17:16:46] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2384 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:16:46] <icinga-wm_>	 RECOVERY - HTTPS non-canonical-redirect-5 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikimedia.is has 244394 seconds left:Certificate wikimedia.is valid until 2024-04-11 10:06:15 +0000 (expires in 37 days) https://wikitech.wikimedia.org/wiki/Ncredir
[17:16:46] <icinga-wm_>	 RECOVERY - HTTPS non-canonical-redirect-1 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 262273 seconds left:Certificate wikipedia.com valid until 2024-04-05 02:10:51 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/Ncredir
[17:16:46] <icinga-wm_>	 RECOVERY - HTTPS non-canonical-redirect-6 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.fi has 433273 seconds left:Certificate wikipedia.fi valid until 2024-05-03 08:30:14 +0000 (expires in 59 days) https://wikitech.wikimedia.org/wiki/Ncredir
[17:17:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (5) Elasticsearch instance elastic2090-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[17:17:50] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Looks good to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994831 (https://phabricator.wikimedia.org/T355462) (owner: 10Houseblaster)
[17:17:54] <wikibugs>	 06SRE, 10observability: Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105#9597339 (10Krinkle)
[17:18:02] <wikibugs>	 06SRE, 10observability, 07Grafana: Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105#9597341 (10Krinkle)
[17:18:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:18:26] <wikibugs>	 06SRE, 10ops-codfw: Spare SSDs for titan2001 ? - https://phabricator.wikimedia.org/T359070#9597348 (10Jhancock.wm) It's been inserted. Lemme know if you need anything else!
[17:18:34] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9597351 (10Marostegui)
[17:18:45] <James_F>	 jan_drewniak: Scap still running?
[17:18:58] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:19:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58381 and previous config saved to /var/cache/conftool/dbconfig/20240304-171913-arnaudb.json
[17:20:16] <jan_drewniak>	 James_F: hey, yeah, still running. It looks like it's produced a few errors but only because of the parse2019.codfw.wmnet depooling
[17:20:25] <James_F>	 Ack.
[17:20:56] <logmsgbot>	 !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1008501| Bumping portals to master (T128546)]] (duration: 45m 54s)
[17:20:59] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[17:21:01] <James_F>	 Aha.
[17:21:06] <logmsgbot>	 !log jforrester@deploy2002 Started scap: Backport for [[gerrit:1007885|ZObjectStore::updateZObjectAsSystemUser: Also give wf-staff rights]]
[17:21:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T357189)', diff saved to https://phabricator.wikimedia.org/P58382 and previous config saved to /var/cache/conftool/dbconfig/20240304-172136-arnaudb.json
[17:21:40] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[17:21:57] <jan_drewniak>	 James_F: K looks like it's done
[17:22:06] <James_F>	 Yeah, perfect. :-)
[17:22:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (5) Elasticsearch instance elastic2090-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[17:24:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) ferm.service on mw2384:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:27:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) resolved: (4) Elasticsearch instance elastic2090-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[17:29:08] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1022.eqiad.wmnet with reason: host reimage
[17:30:35] <wikibugs>	 06SRE, 10ops-eqiad, 06DC-Ops: hw move: GPU from stat1005 to stat1010 - https://phabricator.wikimedia.org/T358763#9597423 (10Jclark-ctr) Removed gpu from stat1005  found power plug has changed between 730xd to 740xd.   both servers powered on with no gpu.    opened ticket requesting cable ordered T359089
[17:31:53] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1022.eqiad.wmnet with reason: host reimage
[17:34:26] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1007885|ZObjectStore::updateZObjectAsSystemUser: Also give wf-staff rights]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:34:34] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[17:36:35] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1012.eqiad.wmnet with OS bullseye
[17:36:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P58383 and previous config saved to /var/cache/conftool/dbconfig/20240304-173642-arnaudb.json
[17:36:51] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1012.eqiad.wmnet with OS bullseye
[17:46:46] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2384 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:46:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T354015)', diff saved to https://phabricator.wikimedia.org/P58384 and previous config saved to /var/cache/conftool/dbconfig/20240304-174653-marostegui.json
[17:46:57] <stashbot>	 T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015
[17:48:30] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:49:11] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1012.eqiad.wmnet with reason: host reimage
[17:49:12] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1022.eqiad.wmnet with OS bullseye
[17:49:27] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1022.eqiad.wmnet with OS bullseye comp...
[17:51:30] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1012.eqiad.wmnet with reason: host reimage
[17:51:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P58385 and previous config saved to /var/cache/conftool/dbconfig/20240304-175148-arnaudb.json
[17:52:39] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1023.eqiad.wmnet with OS bullseye
[17:52:55] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597521 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1023.eqiad.wmnet with OS bullseye
[17:59:51] <logmsgbot>	 !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:1007885|ZObjectStore::updateZObjectAsSystemUser: Also give wf-staff rights]] (duration: 38m 44s)
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1800)
[18:00:05] <jouncebot>	 ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1800).
[18:01:40] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to kubernetes deployment for tjones - https://phabricator.wikimedia.org/T359092 (10TJones)
[18:02:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P58386 and previous config saved to /var/cache/conftool/dbconfig/20240304-180159-marostegui.json
[18:06:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T357189)', diff saved to https://phabricator.wikimedia.org/P58387 and previous config saved to /var/cache/conftool/dbconfig/20240304-180655-arnaudb.json
[18:06:57] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[18:07:03] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[18:07:11] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[18:07:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T357189)', diff saved to https://phabricator.wikimedia.org/P58388 and previous config saved to /var/cache/conftool/dbconfig/20240304-180717-arnaudb.json
[18:08:33] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1023.eqiad.wmnet with reason: host reimage
[18:09:17] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1012.eqiad.wmnet with OS bullseye
[18:09:31] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1012.eqiad.wmnet with OS bullseye comp...
[18:12:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T357189)', diff saved to https://phabricator.wikimedia.org/P58389 and previous config saved to /var/cache/conftool/dbconfig/20240304-181219-arnaudb.json
[18:12:24] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[18:16:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2037.mgmt.codfw.wmnet with reboot policy FORCED
[18:17:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P58390 and previous config saved to /var/cache/conftool/dbconfig/20240304-181705-marostegui.json
[18:23:17] <wikibugs>	 06SRE, 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T359086#9597617 (10VRiley-WMF) a:03VRiley-WMF
[18:24:11] <wikibugs>	 06SRE, 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T359086#9597619 (10VRiley-WMF) Reseated the power supply cable. Monitored issue and the error has been resolved.
[18:24:17] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9597620 (10Jhancock.wm)
[18:24:28] <wikibugs>	 06SRE, 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T359086#9597621 (10VRiley-WMF) 05Open→03Resolved
[18:24:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2035']
[18:26:06] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2035']
[18:26:07] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1023.eqiad.wmnet with OS bullseye
[18:26:24] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1023.eqiad.wmnet with OS bullseye comp...
[18:26:48] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1024.eqiad.wmnet with OS bullseye
[18:26:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2035']
[18:27:00] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2035']
[18:27:01] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597623 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1024.eqiad.wmnet with OS bullseye
[18:27:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P58391 and previous config saved to /var/cache/conftool/dbconfig/20240304-182726-arnaudb.json
[18:27:51] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED
[18:27:59] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED
[18:29:32] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2036']
[18:29:41] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2036']
[18:32:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T354015)', diff saved to https://phabricator.wikimedia.org/P58392 and previous config saved to /var/cache/conftool/dbconfig/20240304-183212-marostegui.json
[18:32:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[18:32:18] <stashbot>	 T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015
[18:32:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[18:40:34] <logmsgbot>	 !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parse1024.eqiad.wmnet with OS bullseye
[18:40:49] <wikibugs>	 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597703 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1024.eqiad.wmnet with OS bullseye exec...
[18:42:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P58393 and previous config saved to /var/cache/conftool/dbconfig/20240304-184234-arnaudb.json
[18:50:29] <icinga-wm_>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[18:50:31] <icinga-wm_>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[18:50:31] <icinga-wm_>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[18:50:37] <icinga-wm_>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[18:50:37] <icinga-wm_>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[18:50:51] <icinga-wm_>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[18:57:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T357189)', diff saved to https://phabricator.wikimedia.org/P58394 and previous config saved to /var/cache/conftool/dbconfig/20240304-185740-arnaudb.json
[18:57:43] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[18:57:44] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[18:57:45] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[18:59:25] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:00:09] <logmsgbot>	 !log htriedman@deploy2002 Started deploy [airflow-dags/analytics_product@a076d5c]: (no justification provided)
[19:00:18] <logmsgbot>	 !log htriedman@deploy2002 Finished deploy [airflow-dags/analytics_product@a076d5c]: (no justification provided) (duration: 00m 09s)
[19:00:42] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[19:00:56] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[19:03:40] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[19:03:53] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[19:06:16] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance
[19:06:29] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance
[19:10:09] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance
[19:10:23] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance
[19:10:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2103 (T357189)', diff saved to https://phabricator.wikimedia.org/P58395 and previous config saved to /var/cache/conftool/dbconfig/20240304-191028-arnaudb.json
[19:10:32] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[19:16:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T357189)', diff saved to https://phabricator.wikimedia.org/P58396 and previous config saved to /var/cache/conftool/dbconfig/20240304-191601-arnaudb.json
[19:16:06] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[19:31:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P58398 and previous config saved to /var/cache/conftool/dbconfig/20240304-193108-arnaudb.json
[19:33:30] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to kubernetes deployment for tjones - https://phabricator.wikimedia.org/T359092#9597927 (10Gehel) Approved as Trey's manager.
[19:33:49] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.04 - 2024.03.24): Requesting access to kubernetes deployment for tjones - https://phabricator.wikimedia.org/T359092#9597928 (10Gehel)
[19:46:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P58399 and previous config saved to /var/cache/conftool/dbconfig/20240304-194614-arnaudb.json
[19:56:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2104.codfw.wmnet with OS bullseye
[19:56:48] <logmsgbot>	 !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@a076d5c]: (no justification provided)
[19:57:14] <logmsgbot>	 !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@a076d5c]: (no justification provided) (duration: 00m 26s)
[19:58:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2105.codfw.wmnet with OS bullseye
[20:01:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T357189)', diff saved to https://phabricator.wikimedia.org/P58400 and previous config saved to /var/cache/conftool/dbconfig/20240304-200121-arnaudb.json
[20:01:23] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance
[20:01:37] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance
[20:01:39] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[20:01:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T357189)', diff saved to https://phabricator.wikimedia.org/P58401 and previous config saved to /var/cache/conftool/dbconfig/20240304-200143-arnaudb.json
[20:02:47] <icinga-wm_>	 RECOVERY - HTTPS non-canonical-redirect-6 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.fi has 423313 seconds left:Certificate wikipedia.fi valid until 2024-05-03 08:30:14 +0000 (expires in 59 days) https://wikitech.wikimedia.org/wiki/Ncredir
[20:02:49] <icinga-wm_>	 RECOVERY - HTTPS non-canonical-redirect-5 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikimedia.is has 234431 seconds left:Certificate wikimedia.is valid until 2024-04-11 10:06:15 +0000 (expires in 37 days) https://wikitech.wikimedia.org/wiki/Ncredir
[20:02:49] <icinga-wm_>	 RECOVERY - HTTPS non-canonical-redirect-1 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 252311 seconds left:Certificate wikipedia.com valid until 2024-04-05 02:10:51 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/Ncredir
[20:02:51] <icinga-wm_>	 RECOVERY - HTTPS non-canonical-redirect-3 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 266168 seconds left:Certificate *.wikipedia.bg valid until 2024-04-13 06:06:54 +0000 (expires in 39 days) https://wikitech.wikimedia.org/wiki/Ncredir
[20:02:51] <icinga-wm_>	 RECOVERY - HTTPS non-canonical-redirect-2 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 321248 seconds left:Certificate *.wikimania.com valid until 2024-05-25 10:21:04 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/Ncredir
[20:03:13] <icinga-wm_>	 RECOVERY - HTTPS non-canonical-redirect-4 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 310426 seconds left:Certificate *.wikispecies.net valid until 2024-05-25 08:20:38 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/Ncredir
[20:08:28] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:08:35] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:12:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2104.codfw.wmnet with reason: host reimage
[20:14:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2105.codfw.wmnet with reason: host reimage
[20:15:47] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2104.codfw.wmnet with reason: host reimage
[20:16:10] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1008528 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[20:18:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2105.codfw.wmnet with reason: host reimage
[20:19:34] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1008535
[20:25:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:25:40] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:28:09] <wikibugs>	 (03PS2) 10Herron: profile::kafka::broker: set cert renewal at 1 month [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870)
[20:28:56] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:29:03] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:31:01] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:31:08] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:33:05] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:33:11] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:33:38] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5025.eqsin.wmnet
[20:34:21] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp5025.eqsin.wmnet with reason: T355905
[20:34:33] <stashbot>	 T355905: Restarting fifo-log-demux should not restart nginx - https://phabricator.wikimedia.org/T355905
[20:34:38] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp5025.eqsin.wmnet with reason: T355905
[20:37:39] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:37:45] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:38:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2104.codfw.wmnet with OS bullseye
[20:41:04] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2105.codfw.wmnet with OS bullseye
[20:41:09] <wikibugs>	 (03PS1) 10Dzahn: ci_test: add profile::ci::website to allow deployments [puppet] - 10https://gerrit.wikimedia.org/r/1008539
[20:41:45] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:41:52] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:46:26] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:46:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:47:44] <wikibugs>	 (03PS3) 10Bking: elastic: move elastic2107 and 2108 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1008528 (https://phabricator.wikimedia.org/T353878)
[20:49:03] <wikibugs>	 (03PS2) 10Dzahn: ci_test: add profile::ci::website to allow deployments [puppet] - 10https://gerrit.wikimedia.org/r/1008539 (https://phabricator.wikimedia.org/T358237)
[20:50:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] elastic: move elastic2107 and 2108 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1008528 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[20:50:20] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: move elastic2107 and 2108 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1008528 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking)
[20:50:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:50:40] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:52:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] ci_test: add profile::ci::website to allow deployments [puppet] - 10https://gerrit.wikimedia.org/r/1008539 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn)
[20:52:52] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:52:59] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:53:07] <wikibugs>	 (03PS10) 10BCornwall: fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905)
[20:54:17] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: HandleSectionLinks: Fix handling headings with raw `>` in attributes [core] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1008472 (https://phabricator.wikimedia.org/T358810)
[20:55:46] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:55:48] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on elastic2107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[20:55:58] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:57:38] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:57:50] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:58:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2107.codfw.wmnet with OS bullseye
[20:58:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Info: Applying configuration version '(a39fe81517) Dzahn - ci_test: add profile::ci::website to allow deployments'" [puppet] - 10https://gerrit.wikimedia.org/r/1008539 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn)
[20:58:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2108.codfw.wmnet with OS bullseye
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T2100). nyaa~
[21:00:04] <jouncebot>	 houseblaster, dbrant, Jdlrobson, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:33] <Jdlrobson>	 o/
[21:00:37] <dbrant>	 o/
[21:00:51] <houseblaster>	 hi!
[21:00:52] <MatmaRex>	 hi
[21:02:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T357189)', diff saved to https://phabricator.wikimedia.org/P58403 and previous config saved to /var/cache/conftool/dbconfig/20240304-210214-arnaudb.json
[21:02:21] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[21:05:44] <wikibugs>	 (03PS1) 10Dzahn: ci_test: test switching firewall provider back to iptables [puppet] - 10https://gerrit.wikimedia.org/r/1008545 (https://phabricator.wikimedia.org/T358237)
[21:08:34] <cjming>	 hi - i can deploy unless someone is already at it?
[21:09:17] <dbrant>	 cjming: all yours
[21:09:32] <cjming>	 cool - i'll go in order of the queue
[21:10:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994831 (https://phabricator.wikimedia.org/T355462) (owner: 10Houseblaster)
[21:11:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] ci_test: test switching firewall provider back to iptables [puppet] - 10https://gerrit.wikimedia.org/r/1008545 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn)
[21:11:45] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994831 (https://phabricator.wikimedia.org/T355462) (owner: 10Houseblaster)
[21:12:03] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]]
[21:12:12] <stashbot>	 T355462: Set $wgSignatureValidation to disallow [enwiki] - https://phabricator.wikimedia.org/T355462
[21:14:15] <inflatador>	 !log bking@cumin2002 depool wdqs2007 for T355873
[21:14:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:18] <stashbot>	 T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873
[21:17:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P58404 and previous config saved to /var/cache/conftool/dbconfig/20240304-211721-arnaudb.json
[21:20:23] <cjming>	 if any SREs are around, my terminal seems to be choking on deploying to test servers - should be quick with a simple config change - any suggestions?
[21:24:20] <logmsgbot>	 !log cjming@deploy2002 cjming and houseblaster: Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:24:26] <jinxer-wm>	 (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:24:29] <cjming>	 nvm - finally went thru
[21:24:33] <stashbot>	 T355462: Set $wgSignatureValidation to disallow [enwiki] - https://phabricator.wikimedia.org/T355462
[21:24:39] <cjming>	 houseblaster: up on test servers if you want to check
[21:25:18] <houseblaster>	 My change is working :)
[21:25:24] <cjming>	 cool - syncing
[21:25:26] <logmsgbot>	 !log cjming@deploy2002 cjming and houseblaster: Continuing with sync
[21:28:56] <cjming>	 in case anyone is around, something does seem off/pokey -- syncing also appears to be getting stuck
[21:29:08] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:29:14] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:32:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P58405 and previous config saved to /var/cache/conftool/dbconfig/20240304-213228-arnaudb.json
[21:33:28] <dancy>	 looking...
[21:43:19] <cjming>	 sorry everyone - something is not right -- deployments are choking -- i've been advised to file a ticket and abort the backport window until it's resolved
[21:43:39] <cjming>	 houseblaster: your patch is not fully deployed
[21:46:21] <MatmaRex>	 hmm, thanks for letting us know. i'll schedule my patch for tomorrow then, it probably wouldn't have made it in time anyway
[21:47:19] <houseblaster>	 I do have to go in a minute. Should I reschedule for tomorrow?
[21:47:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T357189)', diff saved to https://phabricator.wikimedia.org/P58406 and previous config saved to /var/cache/conftool/dbconfig/20240304-214734-arnaudb.json
[21:47:37] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance
[21:47:38] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[21:47:51] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance
[21:47:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T357189)', diff saved to https://phabricator.wikimedia.org/P58407 and previous config saved to /var/cache/conftool/dbconfig/20240304-214757-arnaudb.json
[21:48:30] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:48:41] <cjming>	 houseblaster: i actually think your patch might have just finished syncing -- if you can check that it's live in about 5 minutes (php restarts are about 1/2 way thru) -- if it's not, then please reschedule your patch
[21:49:43] <houseblaster>	 Can do.
[21:49:47] <dancy>	 cjming: Based on the helm rollbacks that happened, the change might be partially deployed (fully deployed on bare metal servers, possibly rolled back on k8s pods).
[21:50:38] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]] (duration: 38m 34s)
[21:50:42] <stashbot>	 T355462: Set $wgSignatureValidation to disallow [enwiki] - https://phabricator.wikimedia.org/T355462
[21:51:05] <cjming>	 dancy: thanks - gtk - ya, my terminal just said the backport failed. Ticket incoming
[21:51:35] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:51:41] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:54:22] <cjming>	 houseblaster: see my reply above - please reschedule, looks like syncing failed
[21:55:18] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:55:25] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:55:32] <cjming>	 backport window is bust - closing for now
[21:56:08] <cjming>	 !log end of UTC late backport window due to deployment errors
[21:56:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T357189)', diff saved to https://phabricator.wikimedia.org/P58408 and previous config saved to /var/cache/conftool/dbconfig/20240304-215626-arnaudb.json
[21:56:30] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[21:57:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) confd_prometheus_metrics.service on elastic2088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T2200).
[22:00:17] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:00:24] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:01:17] <icinga-wm_>	 PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100%
[22:02:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (4) confd_prometheus_metrics.service on elastic2088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:06:26] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:06:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:06:49] <wikibugs>	 06SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9598535 (10wiki_willy) Thanks for confirming, @Volans.  If everyone else is ok with making the correlation on the accounting spreadsheet, my vote is that we go with that route.  Thanks, Willy
[22:09:32] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:09:38] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:11:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P58409 and previous config saved to /var/cache/conftool/dbconfig/20240304-221132-arnaudb.json
[22:11:34] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:11:41] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:19:14] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2107.codfw.wmnet with OS bullseye
[22:19:22] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp5025.eqsin.wmnet with reason: T355905
[22:19:26] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp5025.eqsin.wmnet with reason: T355905
[22:19:29] <stashbot>	 T355905: Restarting fifo-log-demux should not restart nginx - https://phabricator.wikimedia.org/T355905
[22:19:41] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2108.codfw.wmnet with OS bullseye
[22:25:49] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455)
[22:25:54] <wikibugs>	 (03PS2) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455)
[22:25:59] <wikibugs>	 (03PS1) 10Andrew Bogott: role::puppetserver::cloud_vps_project: remove firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1008554 (https://phabricator.wikimedia.org/T351450)
[22:26:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[22:26:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P58410 and previous config saved to /var/cache/conftool/dbconfig/20240304-222639-arnaudb.json
[22:31:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) resolved: Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[22:33:27] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:33:34] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:41:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T357189)', diff saved to https://phabricator.wikimedia.org/P58411 and previous config saved to /var/cache/conftool/dbconfig/20240304-224145-arnaudb.json
[22:41:48] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance
[22:41:50] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[22:42:13] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance
[22:42:49] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Profiler: Silence "RedisException: Connection timed out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003083 (https://phabricator.wikimedia.org/T348756) (owner: 10Krinkle)
[22:43:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2035']
[22:43:45] <wikibugs>	 (03PS2) 10Effie Mouzeli: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498
[22:43:51] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2035']
[22:43:54] <wikibugs>	 (03Merged) 10jenkins-bot: Profiler: Silence "RedisException: Connection timed out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003083 (https://phabricator.wikimedia.org/T348756) (owner: 10Krinkle)
[22:44:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2035']
[22:44:34] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2035']
[22:45:19] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance
[22:45:44] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance
[22:45:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T357189)', diff saved to https://phabricator.wikimedia.org/P58412 and previous config saved to /var/cache/conftool/dbconfig/20240304-224550-arnaudb.json
[22:47:28] <maryum>	 !log deployed patch for T357760
[22:47:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:48:49] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2035.mgmt.codfw.wmnet with reboot policy FORCED
[22:49:39] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2088-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[22:59:08] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:59:15] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:59:25] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:59:39] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2088-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[23:01:06] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2035.mgmt.codfw.wmnet with reboot policy FORCED
[23:01:33] <dancy>	 maryum: Did the security path deployment go smoothly?
[23:01:42] <dancy>	 *patch
[23:01:50] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2035']
[23:02:08] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2035']
[23:03:02] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:03:09] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:05:12] <Dreamy_Jazz>	 I can verify that the security patch worked (if that is what you are asking).
[23:05:58] <dancy>	 In particular I'm wondering if the kubernetes part of the deployment ran smoothly.  There were problems earlier.
[23:08:26] <dancy>	 Hmm.. I'm looking through logstash and appears that the problem persists.
[23:11:10] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:11:17] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:13:44] <wikibugs>	 06SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9598738 (10Volans) Sounds good to me, let me know once done so that I can make the related changes to the report to include those too.
[23:15:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9598740 (10odimitrijevic) Approved
[23:16:13] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:16:19] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:17:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9598742 (10odimitrijevic) Yes, approved
[23:18:17] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:18:23] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:20:31] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:20:37] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:24:29] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:24:36] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:28:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T357189)', diff saved to https://phabricator.wikimedia.org/P58413 and previous config saved to /var/cache/conftool/dbconfig/20240304-232826-arnaudb.json
[23:28:30] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[23:28:42] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:28:49] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:30:46] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:30:53] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:31:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2035.mgmt.codfw.wmnet with reboot policy FORCED
[23:31:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2036.mgmt.codfw.wmnet with reboot policy FORCED
[23:32:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2037.mgmt.codfw.wmnet with reboot policy FORCED
[23:32:03] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2038.mgmt.codfw.wmnet with reboot policy FORCED
[23:32:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2039.mgmt.codfw.wmnet with reboot policy FORCED
[23:32:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2040.mgmt.codfw.wmnet with reboot policy FORCED
[23:32:09] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.68.0" for 413 hosts
[23:32:50] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:32:57] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:35:08] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:35:15] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:37:52] <logmsgbot>	 !log dancy@deploy2002 Locking from deployment [mediawiki]: Mediawiki deployments locked pending resolution of T359114
[23:37:56] <stashbot>	 T359114: Slow and failed deployments - https://phabricator.wikimedia.org/T359114
[23:38:27] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:38:33] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:39:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[23:39:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[23:40:39] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2040.mgmt.codfw.wmnet with reboot policy FORCED
[23:41:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2035.mgmt.codfw.wmnet with reboot policy FORCED
[23:43:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P58414 and previous config saved to /var/cache/conftool/dbconfig/20240304-234332-arnaudb.json
[23:44:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2036.mgmt.codfw.wmnet with reboot policy FORCED
[23:44:11] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2039.mgmt.codfw.wmnet with reboot policy FORCED
[23:48:30] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2037.mgmt.codfw.wmnet with reboot policy FORCED
[23:48:40] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2038.mgmt.codfw.wmnet with reboot policy FORCED
[23:50:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2035.codfw.wmnet with OS bookworm
[23:50:12] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2036.codfw.wmnet with OS bookworm
[23:50:14] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9598783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2035.codfw.wmnet with OS bookworm
[23:50:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2037.codfw.wmnet with OS bookworm
[23:50:17] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9598784 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2036.codfw.wmnet with OS bookworm
[23:50:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2038.codfw.wmnet with OS bookworm
[23:50:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2039.codfw.wmnet with OS bookworm
[23:50:21] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2040.codfw.wmnet with OS bookworm
[23:50:23] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9598785 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2037.codfw.wmnet with OS bookworm
[23:50:35] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9598786 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2038.codfw.wmnet with OS bookworm
[23:50:47] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9598787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2039.codfw.wmnet with OS bookworm
[23:50:59] <wikibugs>	 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9598788 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2040.codfw.wmnet with OS bookworm
[23:52:53] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for bdgreenlee - https://phabricator.wikimedia.org/T359123 (10bdgreenlee)
[23:58:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P58415 and previous config saved to /var/cache/conftool/dbconfig/20240304-235839-arnaudb.json