[00:18:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008068 [00:38:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008068 (owner: 10TrainBranchBot) [01:01:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008068 (owner: 10TrainBranchBot) [01:08:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:24:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:29:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:38:03] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:57:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T354015)', diff saved to https://phabricator.wikimedia.org/P58325 and previous config saved to /var/cache/conftool/dbconfig/20240304-025750-marostegui.json [02:57:54] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [02:58:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:59] (03PS1) 10KartikMistry: Update cxserver to 2024-03-04-023843-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008119 (https://phabricator.wikimedia.org/T350773) [03:11:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:12:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P58326 and previous config saved to /var/cache/conftool/dbconfig/20240304-031256-marostegui.json [03:28:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P58327 and previous config saved to /var/cache/conftool/dbconfig/20240304-032803-marostegui.json [03:43:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T354015)', diff saved to https://phabricator.wikimedia.org/P58328 and previous config saved to /var/cache/conftool/dbconfig/20240304-034309-marostegui.json [03:43:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance [03:43:16] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [03:43:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance [03:43:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T354015)', diff saved to https://phabricator.wikimedia.org/P58329 and previous config saved to /var/cache/conftool/dbconfig/20240304-034333-marostegui.json [04:18:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:51:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:08:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:33] PROBLEM - MariaDB Replica Lag: m1 on db2132 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 393.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:22:39] RECOVERY - MariaDB Replica Lag: m1 on db2132 is OK: OK slave_sql_lag Replication lag: 0.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:01:05] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9594767 (10Marostegui) [06:01:35] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9594768 (10Marostegui) p:05Triage→03Medium [06:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:05:29] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9594770 (10Marostegui) [06:11:44] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9594787 (10Marostegui) While we verify the ssh account - @SBisson can you approve this request? For analytics-privatedata-users group, @odimitrijevic is this approved? [06:12:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2118.codfw.wmnet [06:13:05] (03PS1) 10Marostegui: mariadb: Decommission db2118 [puppet] - 10https://gerrit.wikimedia.org/r/1008126 (https://phabricator.wikimedia.org/T358740) [06:17:48] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [06:19:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2118 [puppet] - 10https://gerrit.wikimedia.org/r/1008126 (https://phabricator.wikimedia.org/T358740) (owner: 10Marostegui) [06:19:56] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2118.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:21:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2118.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:21:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:21:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2118.codfw.wmnet [06:21:42] 10ops-codfw, 06DBA, 10decommission-hardware, 13Patch-For-Review: decommission db2118.codfw.wmnet - https://phabricator.wikimedia.org/T358740#9594825 (10Marostegui) a:05Marostegui→03None [06:22:02] 10ops-codfw, 06DBA, 10decommission-hardware, 13Patch-For-Review: decommission db2118.codfw.wmnet - https://phabricator.wikimedia.org/T358740#9594829 (10Marostegui) This is ready for #dc-ops [06:27:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1186', diff saved to https://phabricator.wikimedia.org/P58330 and previous config saved to /var/cache/conftool/dbconfig/20240304-062703-root.json [06:27:39] (03PS1) 10Marostegui: db1186: Upgrade to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1008129 [06:29:05] (03CR) 10Marostegui: [C: 03+2] db1186: Upgrade to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1008129 (owner: 10Marostegui) [06:35:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 1%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58331 and previous config saved to /var/cache/conftool/dbconfig/20240304-063516-root.json [06:50:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 5%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58332 and previous config saved to /var/cache/conftool/dbconfig/20240304-065021-root.json [06:53:03] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:11] marostegui: OK to deploy cxserver now? [06:55:55] OK. Lasg log was an hour back, starting it.. [06:56:12] kart_: yeah go for it [06:56:15] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2024-03-04-023843-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008119 (https://phabricator.wikimedia.org/T350773) (owner: 10KartikMistry) [06:56:21] kart_: that's just some automated schema change, there will be more coming :) [06:56:39] Thanks! [06:59:36] (03Merged) 10jenkins-bot: Update cxserver to 2024-03-04-023843-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008119 (https://phabricator.wikimedia.org/T350773) (owner: 10KartikMistry) [07:01:04] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [07:01:30] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [07:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:05:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 10%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58333 and previous config saved to /var/cache/conftool/dbconfig/20240304-070526-root.json [07:05:38] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [07:06:16] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [07:07:53] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [07:08:30] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [07:08:42] !log Updated cxserver to 2024-03-04-023843-production (T350773) [07:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:53] T350773: Remove preq and use node fetch - https://phabricator.wikimedia.org/T350773 [07:11:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:18:03] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58334 and previous config saved to /var/cache/conftool/dbconfig/20240304-072031-root.json [07:32:16] !log installing tar security updates [07:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58335 and previous config saved to /var/cache/conftool/dbconfig/20240304-073536-root.json [07:50:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58336 and previous config saved to /var/cache/conftool/dbconfig/20240304-075041-root.json [07:59:07] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007879 (owner: 10Ssingh) [08:00:04] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T0800). [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:01:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007743 (owner: 10Volans) [08:05:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58337 and previous config saved to /var/cache/conftool/dbconfig/20240304-080546-root.json [08:34:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1007739 (https://phabricator.wikimedia.org/T358361) (owner: 10RLazarus) [08:35:05] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007740 (https://phabricator.wikimedia.org/T358361) (owner: 10RLazarus) [08:52:27] (03CR) 10Stevemunene: [C: 03+1] data-platform: fix superset available alerts [alerts] - 10https://gerrit.wikimedia.org/r/1007911 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [08:58:51] (03CR) 10Slyngshede: PKI: Switch alerts to use the x509 metric. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1007321 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:58:56] (03CR) 10Slyngshede: [C: 03+2] PKI: Switch alerts to use the x509 metric. [alerts] - 10https://gerrit.wikimedia.org/r/1007321 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:00:03] (03Merged) 10jenkins-bot: PKI: Switch alerts to use the x509 metric. [alerts] - 10https://gerrit.wikimedia.org/r/1007321 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:01:50] (03PS3) 10Ayounsi: Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 [09:08:26] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:33] (03CR) 10CI reject: [V: 04-1] Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [09:13:53] (03PS2) 10Clément Goubert: calico: Bump wikikube kube-controllers memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007912 [09:14:44] (03CR) 10Clément Goubert: "Thanks, updated the commit message to remove reference to the kubemasters." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007912 (owner: 10Clément Goubert) [09:22:11] (03Abandoned) 10Mainframe98: GerritBot: Escape change number [puppet] - 10https://gerrit.wikimedia.org/r/1008001 (https://phabricator.wikimedia.org/T358940) (owner: 10Mainframe98) [09:24:01] (03CR) 10Clément Goubert: [C: 03+2] calico: Bump wikikube kube-controllers memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007912 (owner: 10Clément Goubert) [09:26:50] (03Merged) 10jenkins-bot: calico: Bump wikikube kube-controllers memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007912 (owner: 10Clément Goubert) [09:27:27] (03CR) 10Brouberol: [C: 03+1] "Thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1007908 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [09:27:35] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:27:38] (03CR) 10Brouberol: [C: 03+1] "Thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1007911 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [09:28:01] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:28:56] (03CR) 10DCausse: "Opensearch claims that the 2.0 client supports opensearch 1.0.0 (which should be equivalent to elastic 7.10.2) as long as we don't use fea" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [09:30:12] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:30:32] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:32:43] (03CR) 10Majavah: [C: 03+2] P:openstack: rabbitmq: restrict clustering ports [puppet] - 10https://gerrit.wikimedia.org/r/1007864 (owner: 10Majavah) [09:36:51] (03CR) 10Filippo Giunchedi: [C: 03+2] data-platform: fix superset available alerts [alerts] - 10https://gerrit.wikimedia.org/r/1007911 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [09:36:56] (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: fix spark alerts deployment [alerts] - 10https://gerrit.wikimedia.org/r/1007908 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [09:38:30] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: Set log group to 'nagios' to resolve permission conflicts [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) (owner: 10Andrea Denisse) [09:38:50] (03Merged) 10jenkins-bot: data-engineering: fix spark alerts deployment [alerts] - 10https://gerrit.wikimedia.org/r/1007908 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [09:38:55] (03Merged) 10jenkins-bot: data-platform: fix superset available alerts [alerts] - 10https://gerrit.wikimedia.org/r/1007911 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [09:43:26] (SystemdUnitFailed) resolved: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:55:25] (SystemdUnitFailed) resolved: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:28:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T354015)', diff saved to https://phabricator.wikimedia.org/P58338 and previous config saved to /var/cache/conftool/dbconfig/20240304-102842-marostegui.json [10:28:47] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [10:31:45] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [10:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [10:43:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P58339 and previous config saved to /var/cache/conftool/dbconfig/20240304-104348-marostegui.json [10:48:29] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on etherpad1003.eqiad.wmnet with reason: Shutdown and decommission old host [10:48:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on etherpad1003.eqiad.wmnet with reason: Shutdown and decommission old host [10:48:55] 06SRE, 10Wikimedia-Etherpad, 06collaboration-services, 13Patch-For-Review, 07User-notice: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9595343 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=eff489f2-c167-46cc-8ac4-c471b433a777) set by jelto@cum... [10:53:23] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet2007-dev.codfw.wmnet with OS bookworm [10:53:46] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet2008-dev.codfw.wmnet with OS bookworm [10:58:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P58340 and previous config saved to /var/cache/conftool/dbconfig/20240304-105855-marostegui.json [10:59:15] !log btullis@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1100) [11:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [11:04:27] (03CR) 10Kamila Součková: [C: 03+1] Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1007888 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [11:05:57] 06SRE, 10ops-eqiad, 06DC-Ops: hw move: GPU from stat1005 to stat1010 - https://phabricator.wikimedia.org/T358763#9595411 (10BTullis) >>! In T358763#9591267, @Jclark-ctr wrote: > @BTullis I will be available monday 10am (est) if that works for you Yes please, that's great. I'll notify the users and make sur... [11:06:24] (03CR) 10Clément Goubert: [C: 03+1] shellbox: fix missing annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006943 (owner: 10Kamila Součková) [11:08:01] (03CR) 10Effie Mouzeli: [C: 03+1] Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1007888 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [11:08:13] !log Depooling mw2314.codfw.wmnet,mw2315.codfw.wmnet,mw2316.codfw.wmnet,mw2320.codfw.wmnet,mw2321.codfw.wmnet,mw2322.codfw.wmnet for move to k8s - T351074 [11:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:25] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [11:08:28] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008070 [11:11:22] (03CR) 10Clément Goubert: [C: 03+2] Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1007888 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [11:11:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [11:12:01] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage [11:12:12] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2008-dev.codfw.wmnet with reason: host reimage [11:14:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T354015)', diff saved to https://phabricator.wikimedia.org/P58341 and previous config saved to /var/cache/conftool/dbconfig/20240304-111401-marostegui.json [11:14:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance [11:14:06] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [11:14:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance [11:14:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T354015)', diff saved to https://phabricator.wikimedia.org/P58342 and previous config saved to /var/cache/conftool/dbconfig/20240304-111424-marostegui.json [11:14:35] (03PS1) 10Btullis: Failover the analytics-hive service to an-coord1004 [dns] - 10https://gerrit.wikimedia.org/r/1008414 (https://phabricator.wikimedia.org/T303168) [11:14:53] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage [11:15:23] (03CR) 10Volans: [C: 03+2] cumin: fix insetup role report mapping [puppet] - 10https://gerrit.wikimedia.org/r/1007743 (owner: 10Volans) [11:16:04] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9595436 (10MoritzMuehlenhoff) [11:17:12] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2008-dev.codfw.wmnet with reason: host reimage [11:17:51] 06SRE, 10Wikimedia-Mailing-lists: Set up mailing list ipbe-zh for zh.wikipedia - https://phabricator.wikimedia.org/T358011#9595443 (10Ladsgroup) Due to https://meta.wikimedia.org/wiki/Mailing_lists/Standardization the name of mailing list should be wikipedia-zh-ipbe. I create it now. [11:18:03] (03CR) 10Kamila Součková: [C: 03+2] shellbox: fix missing annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006943 (owner: 10Kamila Součková) [11:18:03] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:18:28] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2314.codfw.wmnet with OS bullseye [11:18:30] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2315.codfw.wmnet with OS bullseye [11:18:32] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2316.codfw.wmnet with OS bullseye [11:18:35] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2320.codfw.wmnet with OS bullseye [11:18:37] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2321.codfw.wmnet with OS bullseye [11:18:40] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2322.codfw.wmnet with OS bullseye [11:19:30] (03Merged) 10jenkins-bot: shellbox: fix missing annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006943 (owner: 10Kamila Součková) [11:19:50] 06SRE, 10Wikimedia-Mailing-lists: Set up mailing list ipbe-zh for zh.wikipedia - https://phabricator.wikimedia.org/T358011#9595455 (10Ladsgroup) 05Open→03Resolved Done now: https://lists.wikimedia.org/postorius/lists/wikipedia-zh-ipbe.lists.wikimedia.org Note that IPs/UA/email address sent to this email a... [11:20:43] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [11:21:03] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [11:21:53] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [11:21:59] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [11:22:15] (03CR) 10Btullis: superset: rollout the cache user isolation feature flags everywhere (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007854 (https://phabricator.wikimedia.org/T273850) (owner: 10Brouberol) [11:22:37] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add cloud-private IPs for nwe cloudnet-devs - taavi@cumin1002" [11:22:37] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [11:22:43] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [11:23:06] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [11:23:12] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [11:23:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:24:24] !log btullis@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [11:24:37] !log taavi@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add cloud-private IPs for nwe cloudnet-devs - taavi@cumin1002" [11:24:37] !log taavi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:24:42] (03PS1) 10Btullis: Failback hive services to an-coord1003 after restart [dns] - 10https://gerrit.wikimedia.org/r/1008415 (https://phabricator.wikimedia.org/T303168) [11:24:59] !log taavi@cumin1002 START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors [11:25:00] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors [11:25:10] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [11:25:17] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [11:25:40] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [11:25:46] !log taavi@cumin1002 START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on all recursors [11:25:50] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on all recursors [11:26:50] (03PS1) 10Clément Goubert: Move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008416 (https://phabricator.wikimedia.org/T351074) [11:28:08] (03CR) 10Dreamy Jazz: throttle: Allow for overriding temp account creation limits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) (owner: 10Kosta Harlan) [11:28:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:29:18] (03CR) 10Dreamy Jazz: throttle: Allow for overriding temp account creation limits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) (owner: 10Kosta Harlan) [11:30:02] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [11:30:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:30:58] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [11:31:04] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [11:32:16] !log Re-starting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [11:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:04] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [11:33:10] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [11:33:42] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [11:33:48] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [11:34:43] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [11:34:46] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2320.codfw.wmnet with reason: host reimage [11:34:49] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [11:34:53] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2322.codfw.wmnet with reason: host reimage [11:34:59] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2321.codfw.wmnet with reason: host reimage [11:35:07] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2315.codfw.wmnet with reason: host reimage [11:35:10] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2316.codfw.wmnet with reason: host reimage [11:35:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:35:15] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2314.codfw.wmnet with reason: host reimage [11:35:36] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [11:37:25] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2320.codfw.wmnet with reason: host reimage [11:38:30] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [11:39:04] !log btullis@cumin1002 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. [11:39:04] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [11:39:10] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [11:39:30] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2316.codfw.wmnet with reason: host reimage [11:39:35] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [11:39:41] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [11:40:06] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [11:40:12] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [11:40:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:40:37] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [11:40:44] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [11:41:18] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [11:42:04] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2322.codfw.wmnet with reason: host reimage [11:42:37] (03PS1) 10KartikMistry: Update cxserver to 2024-03-04-113412-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008420 (https://phabricator.wikimedia.org/T350773) [11:42:52] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2008-dev.codfw.wmnet with OS bookworm [11:43:05] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2007-dev.codfw.wmnet with OS bookworm [11:43:46] (03CR) 10Btullis: [C: 03+1] Remove an-tool1005 and associated hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1007857 (https://phabricator.wikimedia.org/T358706) (owner: 10Brouberol) [11:44:14] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2314.codfw.wmnet with reason: host reimage [11:44:26] (SystemdUnitFailed) firing: (2) ferm.service on kubernetes2019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:44:44] (03PS2) 10Kosta Harlan: throttle: Allow for overriding temp account creation limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) [11:45:06] (03CR) 10Kosta Harlan: throttle: Allow for overriding temp account creation limits (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) (owner: 10Kosta Harlan) [11:45:17] (03CR) 10Btullis: [C: 03+1] "Thanks, looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1008406 (owner: 10Muehlenhoff) [11:47:37] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2321.codfw.wmnet with reason: host reimage [11:47:54] !log Disabling puppet on C:profile::firewall::log::ferm to deploy 1005978 - T354855 [11:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:57] T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 [11:48:37] hmm no I'm gonna wait until my reimages are done, or it'll mess with them [11:48:56] !log Disregard previous puppet disable message, waiting a bit T354855 [11:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:28] !log btullis@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. [11:49:57] (03PS3) 10Majavah: Add some new networks for WMCS OVS testing [puppet] - 10https://gerrit.wikimedia.org/r/1007901 (https://phabricator.wikimedia.org/T358761) [11:50:02] (03PS1) 10Majavah: hieradata: lock down node-exporter on codfw1dev net-ovs [puppet] - 10https://gerrit.wikimedia.org/r/1008421 [11:50:07] (03PS1) 10Majavah: O:wmcs: codfw1dev: net_ovs: add base neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1008422 (https://phabricator.wikimedia.org/T358761) [11:50:12] (03PS5) 10Btullis: Change the default systemd timer email source to noreply@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) [11:50:17] (03PS5) 10Btullis: Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) [11:50:25] (03PS5) 10Btullis: Allow kerberos::systemd::timer to use a custom email sender [puppet] - 10https://gerrit.wikimedia.org/r/1007578 (https://phabricator.wikimedia.org/T358675) [11:50:40] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2315.codfw.wmnet with reason: host reimage [11:51:06] (03CR) 10Btullis: Change the default systemd timer email source to noreply@wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [11:51:50] (03CR) 10Majavah: [C: 03+2] hieradata: lock down node-exporter on codfw1dev net-ovs [puppet] - 10https://gerrit.wikimedia.org/r/1008421 (owner: 10Majavah) [11:52:24] (03CR) 10Btullis: Allow systemd::timer::job to send from a custom address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [11:56:47] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2019 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:57:26] ^known, due to reimages in progress, I have a patch for this issue queued so I'm leaving it alone to see if the patch fixes it once the reimages are done [11:59:13] (03CR) 10Stevemunene: [C: 03+1] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/1008414 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [11:59:51] (03CR) 10Btullis: [C: 03+2] Failover the analytics-hive service to an-coord1004 [dns] - 10https://gerrit.wikimedia.org/r/1008414 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [12:00:40] (03CR) 10Stevemunene: [C: 03+1] Failback hive services to an-coord1003 after restart [dns] - 10https://gerrit.wikimedia.org/r/1008415 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [12:01:53] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2320.codfw.wmnet with OS bullseye [12:03:21] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2316.codfw.wmnet with OS bullseye [12:05:58] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2322.codfw.wmnet with OS bullseye [12:08:31] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2314.codfw.wmnet with OS bullseye [12:10:50] (03CR) 10Brouberol: [C: 03+1] Failback hive services to an-coord1003 after restart [dns] - 10https://gerrit.wikimedia.org/r/1008415 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [12:11:38] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2315.codfw.wmnet with OS bullseye [12:12:28] (03PS3) 10Brouberol: superset: rollout the cache user isolation feature flags everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007854 (https://phabricator.wikimedia.org/T273850) [12:13:06] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2321.codfw.wmnet with OS bullseye [12:13:14] !log cgoubert@cumin2002 START - Cookbook sre.hosts.remove-downtime for 6 hosts [12:13:21] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 6 hosts [12:14:07] (03PS1) 10Btullis: Create the /usr/share/binfmts directory to fix JRE error [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008428 (https://phabricator.wikimedia.org/T358866) [12:14:47] !log Running homer 'cr*codfw*' commit 'T351074' [12:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:51] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [12:15:55] (03CR) 10Brouberol: [C: 03+1] "Thanks!" [labs/private] - 10https://gerrit.wikimedia.org/r/1008408 (owner: 10Muehlenhoff) [12:16:08] (03CR) 10Brouberol: [C: 03+2] Remove an-tool1005 and associated hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1007857 (https://phabricator.wikimedia.org/T358706) (owner: 10Brouberol) [12:16:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1008422 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [12:17:04] (03CR) 10Majavah: [C: 03+2] O:wmcs: codfw1dev: net_ovs: add base neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1008422 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [12:17:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. When merged, please reassign T358866 to me. Then I can push a revert when I upgrade our Java 8 backports in the future (for th" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008428 (https://phabricator.wikimedia.org/T358866) (owner: 10Btullis) [12:18:27] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Fix location of dummy keytab for an-airflow1007 [labs/private] - 10https://gerrit.wikimedia.org/r/1008408 (owner: 10Muehlenhoff) [12:19:06] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Remove cloud_private_v4_set from cloudgw nftables definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney) [12:19:14] (03CR) 10Btullis: [V: 03+2 C: 03+2] Create the /usr/share/binfmts directory to fix JRE error [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008428 (https://phabricator.wikimedia.org/T358866) (owner: 10Btullis) [12:21:31] !log cgoubert@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2314.codfw.wmnet|mw2315.codfw.wmnet|mw2316.codfw.wmnet|mw2320.codfw.wmnet|mw2321.codfw.wmnet|mw2322.codfw.wmnet),cluster=kubernetes,service=kubesvc [12:22:02] !log Uncordoning mw2314.codfw.wmnet mw2315.codfw.wmnet mw2316.codfw.wmnet mw2320.codfw.wmnet mw2321.codfw.wmnet mw2322.codfw.wmnet - T351074 [12:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:05] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [12:22:51] !log Disabling puppet on C:profile::firewall::log::ferm to deploy new ferm_status.py - T354855 [12:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:55] T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 [12:23:53] (03PS1) 10Majavah: openstack: neutron: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/1008438 [12:25:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1008438 (owner: 10Majavah) [12:25:20] (03CR) 10Brouberol: [C: 03+2] superset: rollout the cache user isolation feature flags everywhere (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007854 (https://phabricator.wikimedia.org/T273850) (owner: 10Brouberol) [12:27:05] (03CR) 10Clément Goubert: [C: 03+2] ferm: Check ferm.service status in ferm_status.py [puppet] - 10https://gerrit.wikimedia.org/r/1005978 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [12:27:42] (03CR) 10Btullis: Allow systemd::timer::job to send from a custom address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [12:27:48] (03CR) 10Majavah: [C: 03+2] openstack: neutron: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/1008438 (owner: 10Majavah) [12:28:32] !log Enabling puppet on kubernetes2019 to test new ferm_status.py - T354855 [12:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:35] T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 [12:30:53] !log Enabling puppet on mw2322 to test new ferm_status.py - T354855 [12:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:49] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "the policy can be improved for better network security. I will make a proposal soon." [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [12:33:11] !log Enabling puppet on puppetboard2003 to test new ferm_status.py - T354855 [12:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:26] (SystemdUnitFailed) firing: (2) ferm.service on kubernetes2019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:35:27] ^this actually means it resolved, because another systemd unit of another type is failing [12:35:48] !log jelto@cumin1002 START - Cookbook sre.hosts.decommission for hosts etherpad1003.eqiad.wmnet [12:36:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [12:36:54] looks like the patched ferm_status.py works correctly, puppet doesn't restart the service on every run, the status looks good, re-enabling puppet fleet-wide moritzm [12:36:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [12:37:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [12:37:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [12:38:06] !log Re-enabling puppet on C:profile::firewall::log::ferm to deploy new ferm_status.py - T354855 [12:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:09] T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 [12:39:24] claime: great, sgtm [12:41:05] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:43:34] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: etherpad1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jelto@cumin1002" [12:43:54] (03CR) 10Kamila Součková: [C: 03+1] Move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008416 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [12:44:54] 06SRE, 06Infrastructure-Foundations, 06serviceops: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855#9595712 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Deployed, puppet now restarts ferm.service if the sy... [12:45:15] !log Depooling mw1350.eqiad.wmnet,mw1351.eqiad.wmnet,mw1352.eqiad.wmnet,mw1353.eqiad.wmnet,mw1354.eqiad.wmnet for move to kubernetes - T351074 [12:45:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: etherpad1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jelto@cumin1002" [12:45:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:45:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts etherpad1003.eqiad.wmnet [12:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:20] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [12:45:35] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9595718 (10fnegri) @bking thanks for having a look! No rush really, I was... [12:45:44] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1008429 (owner: 10L10n-bot) [12:47:48] (03CR) 10Clément Goubert: [C: 03+2] Move 5 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008416 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [12:50:24] (03PS1) 10Jelto: site.pp: remove old etherpad1003 host [puppet] - 10https://gerrit.wikimedia.org/r/1008444 (https://phabricator.wikimedia.org/T359047) [12:50:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [12:52:14] (03CR) 10Muehlenhoff: Change the default systemd timer email source to noreply@wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [12:52:56] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1350.eqiad.wmnet with OS bullseye [12:52:59] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1351.eqiad.wmnet with OS bullseye [12:53:02] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1352.eqiad.wmnet with OS bullseye [12:53:04] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1353.eqiad.wmnet with OS bullseye [12:53:07] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1354.eqiad.wmnet with OS bullseye [12:54:54] (03PS1) 10EoghanGaffney: [vrts] Remove ticket-test.wm.o and vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/1008445 (https://phabricator.wikimedia.org/T359041) [12:55:12] (03CR) 10Muehlenhoff: "I think you missed hieradata/hosts/etherpad1003.yaml?" [puppet] - 10https://gerrit.wikimedia.org/r/1008444 (https://phabricator.wikimedia.org/T359047) (owner: 10Jelto) [12:55:54] (03PS2) 10Jelto: site.pp: remove old etherpad1003 host [puppet] - 10https://gerrit.wikimedia.org/r/1008444 (https://phabricator.wikimedia.org/T359047) [12:56:39] (03CR) 10Jelto: "yes thanks! I removed the file in patch set 2." [puppet] - 10https://gerrit.wikimedia.org/r/1008444 (https://phabricator.wikimedia.org/T359047) (owner: 10Jelto) [12:56:47] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2019 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:00:15] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:00:42] (03CR) 10Muehlenhoff: [C: 03+1] site.pp: remove old etherpad1003 host [puppet] - 10https://gerrit.wikimedia.org/r/1008444 (https://phabricator.wikimedia.org/T359047) (owner: 10Jelto) [13:01:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 41.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:01:38] (03CR) 10Jelto: [C: 03+2] site.pp: remove old etherpad1003 host [puppet] - 10https://gerrit.wikimedia.org/r/1008444 (https://phabricator.wikimedia.org/T359047) (owner: 10Jelto) [13:04:13] AppserversUnreachable is transient due to reimages in progress [13:05:01] akosiaris: do we want to run parsoid hotter than web/api deployments? if we do, we should adapt the alert a bit [13:05:33] claime: no, I don't think we do [13:06:10] but I am in the process of migrating this week, so I think we might want to handle this alert a bit differently this week [13:06:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:06:21] Right, that's why I was asking :) [13:06:35] (03PS1) 10Gmodena: eventstreams: change default num_workers to 0. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008446 (https://phabricator.wikimedia.org/T359051) [13:06:40] I 'll add more hosts and capacity today [13:06:46] ack [13:06:52] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1350.eqiad.wmnet with reason: host reimage [13:06:59] (03PS1) 10EoghanGaffney: [vrts] Remove ticket-test.wm.o and vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1008447 (https://phabricator.wikimedia.org/T359041) [13:07:07] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1352.eqiad.wmnet with reason: host reimage [13:07:13] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1353.eqiad.wmnet with reason: host reimage [13:07:27] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1351.eqiad.wmnet with reason: host reimage [13:07:29] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1354.eqiad.wmnet with reason: host reimage [13:09:53] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1350.eqiad.wmnet with reason: host reimage [13:10:04] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:10:13] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:10:19] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9595821 (10MoritzMuehlenhoff) So, in order to move over the access from the existing kvc-wikimf account to kcvelaga we would need to do the following:... [13:12:07] 06SRE, 10Wikimedia-Etherpad, 06collaboration-services, 13Patch-For-Review, 07User-notice: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9595828 (10Jelto) [13:12:17] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1352.eqiad.wmnet with reason: host reimage [13:12:26] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:12:28] 06SRE, 10Wikimedia-Etherpad, 06collaboration-services, 13Patch-For-Review, 07User-notice: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9595829 (10Jelto) [13:12:56] !log installing jqueryui security updates [13:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:07] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:14:12] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:14:26] (SystemdUnitFailed) firing: (2) ferm.service on mw1367:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:42] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:14:43] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1351.eqiad.wmnet with reason: host reimage [13:15:27] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:15:31] !log dcaro@cumin1002 START - Cookbook sre.dns.netbox [13:15:54] ^The ferm.service error popping up is expected, it should resolve itself with the next puppet run [13:16:15] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:17:08] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:17:18] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:17:23] !log dcaro@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:17:31] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1353.eqiad.wmnet with reason: host reimage [13:17:47] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:18:05] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:19:16] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:19:49] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:20:10] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1354.eqiad.wmnet with reason: host reimage [13:20:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:20:35] (03PS6) 10Arturo Borrero Gonzalez: Remove cloud_private_v4_set from cloudgw nftables definition [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney) [13:21:04] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:21:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [13:22:03] 06SRE, 10Wikimedia-Etherpad, 06collaboration-services, 13Patch-For-Review, 07User-notice: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9595841 (10Jelto) 05Open→03Resolved >>! In T316421#9590106, @dcausse wrote: > Since the upgrade I believe that we are affected... [13:22:28] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:23:03] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:26] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:24:24] (03PS6) 10Btullis: Change the default systemd timer email source to noreply@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) [13:24:26] !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=parsoid-php,name=parse102.*,dc=eqiad [13:24:29] (03PS6) 10Btullis: Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) [13:24:34] (03PS6) 10Btullis: Allow kerberos::systemd::timer to use a custom email sender [puppet] - 10https://gerrit.wikimedia.org/r/1007578 (https://phabricator.wikimedia.org/T358675) [13:25:00] !log depool parse102.* from parsoid-php in eqiad T358752 [13:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:04] T358752: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752 [13:27:06] !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=parsoid-php,name=parse101[012],dc=eqiad [13:27:07] (03PS1) 10Cathal Mooney: Add shell user for kcvelaga, mirroring kcv-wikimf [puppet] - 10https://gerrit.wikimedia.org/r/1008450 (https://phabricator.wikimedia.org/T358658) [13:27:20] (03PS1) 10Btullis: Add a new deployment target in the beta cluster [dumps/scap] - 10https://gerrit.wikimedia.org/r/1008451 (https://phabricator.wikimedia.org/T325228) [13:28:07] !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=parsoid-php,name=parse101[012].eqiad.wmnet,dc=eqiad [13:28:13] PROBLEM - Check whether ferm is active by checking the default input chain on mw1367 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:28:27] !log jnuche@deploy2002 Started deploy [zuul/deploy@bb76c45]: (no justification provided) [13:28:53] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:28:57] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1350.eqiad.wmnet with OS bullseye [13:29:14] ^^ test deploy to new host, forgot to add message, please ignore [13:30:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [13:31:07] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.03.04 - 2024.03.24), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9595872 (10BTullis) a:03BTullis [13:31:22] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.03.04 - 2024.03.24), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9595870 (10BTullis) Moving this into our current milestone, as we are currently working on tes... [13:31:28] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1352.eqiad.wmnet with OS bullseye [13:31:38] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.03.04 - 2024.03.24), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9595880 (10BTullis) [13:32:06] (03CR) 10Btullis: Change the default systemd timer email source to noreply@wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [13:32:14] (03CR) 10Btullis: [C: 03+2] Change the default systemd timer email source to noreply@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [13:33:01] !log jnuche@deploy2002 Finished deploy [zuul/deploy@bb76c45]: (no justification provided) (duration: 04m 33s) [13:33:35] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1351.eqiad.wmnet with OS bullseye [13:33:48] 07sre-alert-triage, 10Data-Platform-SRE (2024.03.04 - 2024.03.24): Alert in need of triage: Number of requests triggering circuit breakers due to excessive memory usage (instance graphite1005) - https://phabricator.wikimedia.org/T357614#9595890 (10Gehel) [13:33:54] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:34:42] 06SRE, 10ops-eqiad, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9595893 (10Gehel) [13:35:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1008450 (https://phabricator.wikimedia.org/T358658) (owner: 10Cathal Mooney) [13:36:39] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1353.eqiad.wmnet with OS bullseye [13:37:29] (03PS2) 10Cathal Mooney: Add shell user for kcvelaga, mirroring kcv-wikimf [puppet] - 10https://gerrit.wikimedia.org/r/1008450 (https://phabricator.wikimedia.org/T358658) [13:38:28] (03PS1) 10Alexandros Kosiaris: Move 5 eqiad parsoid servers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008452 (https://phabricator.wikimedia.org/T358752) [13:39:18] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1354.eqiad.wmnet with OS bullseye [13:39:26] (SystemdUnitFailed) firing: (2) ferm.service on mw1367:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:09] (03CR) 10Clément Goubert: [C: 03+1] Move 5 eqiad parsoid servers to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1008452 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris) [13:41:31] !log Running homer 'cr*eqiad*' commit 'T351074' [13:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:36] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [13:46:21] RECOVERY - Check whether ferm is active by checking the default input chain on mw1367 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:47:21] (03CR) 10ArielGlenn: [C: 03+2] Add a new deployment target in the beta cluster [dumps/scap] - 10https://gerrit.wikimedia.org/r/1008451 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis) [13:47:36] !log cgoubert@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1350.eqiad.wmnet|mw1351.eqiad.wmnet|mw1352.eqiad.wmnet|mw1353.eqiad.wmnet|mw1354.eqiad.wmnet),cluster=kubernetes,service=kubesvc [13:47:55] !log Uncordoning mw1351.eqiad.wmnet mw1352.eqiad.wmnet mw1353.eqiad.wmnet mw1354.eqiad.wmnet - T351074 [13:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:59] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [13:48:03] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:03] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [13:49:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [13:49:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T357189)', diff saved to https://phabricator.wikimedia.org/P58343 and previous config saved to /var/cache/conftool/dbconfig/20240304-134922-arnaudb.json [13:49:26] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:49:59] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008071 [13:50:28] 07Puppet, 06SRE, 10Observability-Alerting, 10Puppet-Infrastructure: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720#9595942 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Optimistically resolving since we've moved to prometheus-based alert... [13:50:54] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] Add a new deployment target in the beta cluster [dumps/scap] - 10https://gerrit.wikimedia.org/r/1008451 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis) [13:51:07] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1010.eqiad.wmnet with reason: re-image [13:51:33] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1010.eqiad.wmnet with reason: re-image [13:51:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [13:51:46] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9595953 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=63abb5d8-03a7-48ae-abcc-214900c13c28) set by akosiaris@cumin1002 for 2:00:0... [13:54:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T357189)', diff saved to https://phabricator.wikimedia.org/P58344 and previous config saved to /var/cache/conftool/dbconfig/20240304-135446-arnaudb.json [13:54:51] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:56:04] (03PS2) 10Alexandros Kosiaris: Move 8 eqiad parsoid servers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008452 (https://phabricator.wikimedia.org/T358752) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:02:51] (03PS3) 10Ssingh: dns::auth: move all service state management to confd [puppet] - 10https://gerrit.wikimedia.org/r/1007918 (https://phabricator.wikimedia.org/T347054) [14:04:12] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1568/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007918 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:04:40] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9596013 (10ABran-WMF) [14:04:53] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9596015 (10bking) > @bking what if we release spicerack with the change... [14:05:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] "duh, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1008452 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris) [14:09:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P58345 and previous config saved to /var/cache/conftool/dbconfig/20240304-140952-arnaudb.json [14:11:47] PROBLEM - MariaDB Replica SQL: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Could not execute Write_rows_v1 event on table nlwiki.recentchanges: Index for table recentchanges is corrupt: try to repair it, Error_code: 1034: handler error HA_ERR_CRASHED: the events master log db1155-bin.001893, end_log_pos 431898912 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depoolin [14:11:47] ca [14:12:30] Any gerrit admin around? Could you please add me to `Trusted-Contributors` (2021f25e7515187a81d51f8fe14dd6f25617cce0) ? [14:12:52] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1010.eqiad.wmnet with OS bullseye [14:13:06] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9596041 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1010.eqiad.wmnet with OS bullseye [14:13:30] (03CR) 10Muehlenhoff: LDAPBackend: Implement limit checks for UID (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [14:16:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [14:16:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2158.codfw.wmnet with reason: Silence for maintenance T356240 [14:17:04] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9596058 (10MoritzMuehlenhoff) [14:17:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2158.codfw.wmnet with reason: Silence for maintenance T356240 [14:17:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 ', diff saved to https://phabricator.wikimedia.org/P58346 and previous config saved to /var/cache/conftool/dbconfig/20240304-141730-arnaudb.json [14:17:43] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2158.codfw.wmnet [14:18:39] 06SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9596066 (10Volans) @wiki_willy yes, if we go that way then I guess a separate tab on the accounting sheet with both asset tags (chassis and motherboard), compiled only for the hosts that have had th... [14:19:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'For maint', diff saved to https://phabricator.wikimedia.org/P58347 and previous config saved to /var/cache/conftool/dbconfig/20240304-141913-ladsgroup.json [14:19:35] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db2156.codfw.wmnet onto db2194.codfw.wmnet [14:20:01] PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:21:49] !og installing glib2.0 security updates [14:22:06] (03CR) 10Majavah: LDAPBackend: Implement limit checks for UID (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [14:22:42] (03CR) 10Cathal Mooney: Remove cloud_private_v4_set from cloudgw nftables definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney) [14:22:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2158.codfw.wmnet [14:22:58] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9596083 (10Jhancock.wm) @ABran-WMF I'll be here for that. [14:23:40] (03CR) 10Muehlenhoff: LDAPBackend: Implement limit checks for UID (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [14:24:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P58348 and previous config saved to /var/cache/conftool/dbconfig/20240304-142459-arnaudb.json [14:25:31] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1010.eqiad.wmnet with reason: host reimage [14:27:30] (03CR) 10Btullis: elastic: add elastic2088-2109 to production role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [14:27:36] !log disable puppet on A:lvs to merge CR 1007879 [14:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:49] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1010.eqiad.wmnet with reason: host reimage [14:28:07] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9596105 (10ABran-WMF) I'll depool the node around 15:55 UTC then and will wait for your confirmation to repool it [14:28:52] !log reprepro -C component/pybal include bullseye-wikimedia pybal_1.15.14_amd64.changes [14:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:12] (03CR) 10Ssingh: [C: 03+2] Revert "pybal: do not install from component" [puppet] - 10https://gerrit.wikimedia.org/r/1007879 (owner: 10Ssingh) [14:29:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58349 and previous config saved to /var/cache/conftool/dbconfig/20240304-142921-arnaudb.json [14:29:32] (03PS2) 10Ssingh: Revert "pybal: do not install from component" [puppet] - 10https://gerrit.wikimedia.org/r/1007879 [14:29:52] (03CR) 10Ssingh: Revert "pybal: do not install from component" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007879 (owner: 10Ssingh) [14:30:04] (03CR) 10AOkoth: [C: 03+1] [vrts] Remove ticket-test.wm.o and vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/1008445 (https://phabricator.wikimedia.org/T359041) (owner: 10EoghanGaffney) [14:30:19] (03CR) 10Btullis: [C: 03+2] Failback hive services to an-coord1003 after restart [dns] - 10https://gerrit.wikimedia.org/r/1008415 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [14:30:20] !log manually update PCC facts from puppetserver1001 to pick up cloudnet2007/8-dev os upgrade [14:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:51] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Revert "pybal: do not install from component" [puppet] - 10https://gerrit.wikimedia.org/r/1007879 (owner: 10Ssingh) [14:30:56] (03Abandoned) 10Reedy: captchaloop: Generate old and new captchas [puppet] - 10https://gerrit.wikimedia.org/r/990715 (owner: 10Reedy) [14:30:58] RECOVERY - MariaDB Replica SQL: s2 on clouddb1014 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:32:08] RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:33:14] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9596119 (10KCVelaga_WMF) @MoritzMuehlenhoff When I change my email to wikimedia.org for the developer account, I am encountering a... [14:34:26] (03PS4) 10Bking: elastic: add elastic2088-2109 to production role [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) [14:35:45] (03CR) 10AOkoth: [C: 03+1] [vrts] Remove ticket-test.wm.o and vrts1002 [puppet] - 10https://gerrit.wikimedia.org/r/1008447 (https://phabricator.wikimedia.org/T359041) (owner: 10EoghanGaffney) [14:36:31] (03PS1) 10Giuseppe Lavagetto: multiversion-base: rebuild to include new php-luasandbox [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008457 (https://phabricator.wikimedia.org/T358867) [14:36:46] (03CR) 10Bking: elastic: add elastic2088-2109 to production role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [14:37:01] (03PS5) 10Bking: elastic: add elastic2088-2109 to production role [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) [14:37:06] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] multiversion-base: rebuild to include new php-luasandbox [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008457 (https://phabricator.wikimedia.org/T358867) (owner: 10Giuseppe Lavagetto) [14:38:03] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:10] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9596140 (10MoritzMuehlenhoff) @KCVelaga_WMF : That is expected, your kcvelaga account isn't yet part of the cn=wmf LDAP group, it... [14:38:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [14:40:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T357189)', diff saved to https://phabricator.wikimedia.org/P58350 and previous config saved to /var/cache/conftool/dbconfig/20240304-144005-arnaudb.json [14:40:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [14:40:10] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:40:21] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9596151 (10KCVelaga_WMF) @MoritzMuehlenhoff Ah okay! Thanks for clarifying. Also, to answer your second question, all of my work i... [14:40:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [14:43:09] !log sudo cumin -b1 -s 30 "A:lvs and not P{lvs2014*}" "run-puppet-agent --enable 'merging CR 1007879'" [14:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance [14:43:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance [14:43:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T357189)', diff saved to https://phabricator.wikimedia.org/P58351 and previous config saved to /var/cache/conftool/dbconfig/20240304-144344-arnaudb.json [14:44:25] (03CR) 10Cathal Mooney: [C: 03+2] Add shell user for kcvelaga, mirroring kcv-wikimf [puppet] - 10https://gerrit.wikimedia.org/r/1008450 (https://phabricator.wikimedia.org/T358658) (owner: 10Cathal Mooney) [14:44:26] (SystemdUnitFailed) firing: (2) ferm.service on mw1453:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58352 and previous config saved to /var/cache/conftool/dbconfig/20240304-144426-arnaudb.json [14:45:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1010.eqiad.wmnet with OS bullseye [14:45:31] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9596165 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1010.eqiad.wmnet with OS bullseye comp... [14:45:48] (03CR) 10Muehlenhoff: [C: 03+2] Pass firewall range in profile::firewall syntax for remaining Airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1008406 (owner: 10Muehlenhoff) [14:46:14] (03PS2) 10Muehlenhoff: airflow: Remove ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/1008407 [14:47:24] (03CR) 10CI reject: [V: 04-1] airflow: Remove ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/1008407 (owner: 10Muehlenhoff) [14:48:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T357189)', diff saved to https://phabricator.wikimedia.org/P58353 and previous config saved to /var/cache/conftool/dbconfig/20240304-144844-arnaudb.json [14:48:48] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:50:18] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on stat1005.eqiad.wmnet with reason: Moving GPU from stat1005 to stat1010 [14:50:32] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on stat1005.eqiad.wmnet with reason: Moving GPU from stat1005 to stat1010 [14:50:40] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on stat1010.eqiad.wmnet with reason: Moving GPU from stat1005 to stat1010 [14:50:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on stat1010.eqiad.wmnet with reason: Moving GPU from stat1005 to stat1010 [14:53:30] 06SRE, 10ops-eqiad, 06DC-Ops: hw move: GPU from stat1005 to stat1010 - https://phabricator.wikimedia.org/T358763#9596240 (10BTullis) The two servers have been shut down and are ready for the hardware swap. [14:53:51] (03PS3) 10Muehlenhoff: airflow: Remove ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/1008407 [14:53:58] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9596237 (10cmooney) >>! In T358658#9596140, @MoritzMuehlenhoff wrote: > @KCVelaga_WMF : That is expected, your kcvelaga account is... [14:54:16] PROBLEM - Check whether ferm is active by checking the default input chain on mw1453 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:54:31] (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [14:58:03] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:19] (03PS1) 10Majavah: openstack: neutron: add API support for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1008462 (https://phabricator.wikimedia.org/T326373) [14:58:24] (03PS1) 10Majavah: openstack: neutron: first attempt of installing ovs-agent [puppet] - 10https://gerrit.wikimedia.org/r/1008463 (https://phabricator.wikimedia.org/T326373) [14:59:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58354 and previous config saved to /var/cache/conftool/dbconfig/20240304-145931-arnaudb.json [15:00:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1008407 (owner: 10Muehlenhoff) [15:00:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2117.codfw.wmnet with reason: Silence for maintenance [15:00:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2117.codfw.wmnet with reason: Silence for maintenance [15:01:46] (03PS2) 10Majavah: openstack: neutron: add API support for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1008462 (https://phabricator.wikimedia.org/T326373) [15:01:51] (03PS2) 10Majavah: openstack: neutron: first attempt of installing ovs-agent [puppet] - 10https://gerrit.wikimedia.org/r/1008463 (https://phabricator.wikimedia.org/T326373) [15:03:32] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1570/co" [puppet] - 10https://gerrit.wikimedia.org/r/1008462 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [15:03:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P58356 and previous config saved to /var/cache/conftool/dbconfig/20240304-150350-arnaudb.json [15:04:20] <_joe_> !log installing php-luasandbox update on mediawiki canaries T353414 [15:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:23] T353414: Build and deploy LuaSandbox 4.1.2 - https://phabricator.wikimedia.org/T353414 [15:09:15] (03PS1) 10Bking: flink-kubernetes-operator: change flink download URL [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008486 (https://phabricator.wikimedia.org/T358879) [15:09:26] (SystemdUnitFailed) firing: (2) ferm.service on mw1453:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:13:17] (03CR) 10Stevemunene: "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [15:13:22] (03PS3) 10Majavah: openstack: neutron: first attempt of installing ovs-agent [puppet] - 10https://gerrit.wikimedia.org/r/1008463 (https://phabricator.wikimedia.org/T326373) [15:13:31] (03CR) 10DCausse: [C: 03+1] flink-kubernetes-operator: change flink download URL [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008486 (https://phabricator.wikimedia.org/T358879) (owner: 10Bking) [15:13:35] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [15:13:41] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [15:14:29] (03PS3) 10Eevans: restbase: provision restbase1037-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005593 (https://phabricator.wikimedia.org/T354560) [15:14:34] (03PS3) 10Eevans: restbase: provision restbase1038-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005594 (https://phabricator.wikimedia.org/T354560) [15:14:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58357 and previous config saved to /var/cache/conftool/dbconfig/20240304-151436-arnaudb.json [15:14:39] (03PS3) 10Eevans: restbase: provision restbase1039-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005595 (https://phabricator.wikimedia.org/T354560) [15:14:47] (03PS3) 10Eevans: restbase: provision restbase1040-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005596 (https://phabricator.wikimedia.org/T354560) [15:14:55] (03PS3) 10Eevans: restbase: provision restbase1041-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005597 (https://phabricator.wikimedia.org/T354560) [15:15:03] (03PS3) 10Eevans: restbase: provision restbase1042-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005598 (https://phabricator.wikimedia.org/T354560) [15:15:36] (03CR) 10Bking: [C: 03+2] elastic: add elastic2088-2109 to production role [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [15:17:40] (03CR) 10Eevans: [C: 03+2] restbase: provision restbase1037-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005593 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans) [15:17:56] (03CR) 10Brouberol: elastic: add elastic2088-2109 to production role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [15:18:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P58358 and previous config saved to /var/cache/conftool/dbconfig/20240304-151856-arnaudb.json [15:19:20] (03PS1) 10Effie Mouzeli: mw-mcrouter: adjust resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008487 [15:20:30] (03CR) 10Majavah: [C: 03+2] P:openstack: rabbitmq: remove cinder-backups term [puppet] - 10https://gerrit.wikimedia.org/r/1007295 (https://phabricator.wikimedia.org/T344065) (owner: 10Majavah) [15:20:53] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: adjust resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008487 (owner: 10Effie Mouzeli) [15:21:05] (03PS9) 10Majavah: P:openstack: rabbitmq: use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998419 [15:22:31] (03Merged) 10jenkins-bot: mw-mcrouter: adjust resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008487 (owner: 10Effie Mouzeli) [15:22:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1573/co" [puppet] - 10https://gerrit.wikimedia.org/r/998419 (owner: 10Majavah) [15:23:30] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [15:23:38] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [15:24:09] (03CR) 10Dzahn: "the removal from site.pp needs to happen after the decom cookbook finished. but at the same time it will warn you about remaining strings " [puppet] - 10https://gerrit.wikimedia.org/r/1008447 (https://phabricator.wikimedia.org/T359041) (owner: 10EoghanGaffney) [15:24:15] RECOVERY - Check whether ferm is active by checking the default input chain on mw1453 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:25:34] (03Abandoned) 10Dzahn: site: remove etherpad on bullseye machine [puppet] - 10https://gerrit.wikimedia.org/r/1003075 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn) [15:26:05] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack: rabbitmq: use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998419 (owner: 10Majavah) [15:27:37] (03CR) 10Brouberol: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1008407 (owner: 10Muehlenhoff) [15:29:12] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1011.eqiad.wmnet with OS bullseye [15:29:26] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9596436 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1011.eqiad.wmnet with OS bullseye [15:30:36] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1037.eqiad.wmnet with reason: Bootstrapping — T354560 [15:30:43] T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560 [15:30:50] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1037.eqiad.wmnet with reason: Bootstrapping — T354560 [15:31:24] (03PS1) 10Effie Mouzeli: mw-mcrouter: adjust resources (cpu) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008489 [15:32:15] 06SRE, 06Infrastructure-Foundations, 06serviceops, 07ARM support: Adoption of aarch64 (aka arm64) in WMF production? (SRE Summit 2022 Session) - https://phabricator.wikimedia.org/T320811#9596480 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:34:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T357189)', diff saved to https://phabricator.wikimedia.org/P58359 and previous config saved to /var/cache/conftool/dbconfig/20240304-153403-arnaudb.json [15:34:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance [15:34:07] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:34:18] (03PS2) 10Effie Mouzeli: mw-mcrouter: adjust resources 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008489 [15:34:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance [15:34:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T357189)', diff saved to https://phabricator.wikimedia.org/P58360 and previous config saved to /var/cache/conftool/dbconfig/20240304-153425-arnaudb.json [15:35:57] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: adjust resources 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008489 (owner: 10Effie Mouzeli) [15:37:02] (03Merged) 10jenkins-bot: mw-mcrouter: adjust resources 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008489 (owner: 10Effie Mouzeli) [15:38:11] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [15:38:18] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [15:39:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T357189)', diff saved to https://phabricator.wikimedia.org/P58361 and previous config saved to /var/cache/conftool/dbconfig/20240304-153933-arnaudb.json [15:39:37] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:40:21] (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Raise replicas for 55% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006526 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [15:40:44] jouncebot: nowandnext [15:40:44] No deployments scheduled for the next 0 hour(s) and 49 minute(s) [15:40:44] In 0 hour(s) and 49 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1630) [15:40:54] a'ight *cracks knuckles* [15:41:16] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 55% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006526 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [15:42:06] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1011.eqiad.wmnet with reason: host reimage [15:43:05] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:43:10] (03CR) 10Bking: [C: 03+2] flink-kubernetes-operator: change flink download URL [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008486 (https://phabricator.wikimedia.org/T358879) (owner: 10Bking) [15:43:26] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:43:34] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [15:43:48] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [15:43:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db2132.codfw.wmnet with reason: Silence for maintenance [15:43:56] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [15:44:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db2132.codfw.wmnet with reason: Silence for maintenance [15:44:14] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [15:44:21] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [15:44:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1011.eqiad.wmnet with reason: host reimage [15:44:52] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9596555 (10ABran-WMF) server downtimed [15:45:28] (03CR) 10Bking: [V: 03+2 C: 03+2] flink-kubernetes-operator: change flink download URL [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1008486 (https://phabricator.wikimedia.org/T358879) (owner: 10Bking) [15:46:35] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [15:46:45] (WidespreadPuppetFailure) firing: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:47:39] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:49:04] ^ bunch of parse and elastic failures [15:51:45] (WidespreadPuppetFailure) resolved: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:52:39] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (4) Elasticsearch instance elastic2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:52:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (12) Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:52:50] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9596598 (10Jhancock.wm) it's been swapped. [15:52:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2156.codfw.wmnet onto db2194.codfw.wmnet [15:53:09] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 55% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1006527 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [15:54:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P58362 and previous config saved to /var/cache/conftool/dbconfig/20240304-155439-arnaudb.json [15:56:14] !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=parsoid-php,dc=codfw,name=parse200[1-5].codfw.wmnet [15:56:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2124.codfw.wmnet with reason: Silence for maintenance T356240 [15:57:05] !log depool parse200[1-5] from parsoid from re-imaging. T358752 [15:57:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2124.codfw.wmnet with reason: Silence for maintenance T356240 [15:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:08] T358752: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752 [15:57:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (16) Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:57:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 ', diff saved to https://phabricator.wikimedia.org/P58363 and previous config saved to /var/cache/conftool/dbconfig/20240304-155742-arnaudb.json [15:57:53] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2124.codfw.wmnet [15:58:07] !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: service=parsoid-php,dc=codfw,name=parse200[1-5].codfw.wmnet [15:58:34] !log repool parse200[1-5] in parsoid. There are 2 canaries in that set, I 'll leave them for last. T358752. [15:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:03] !log depool parse2016-parse2020 from parsoid from re-imaging. T358752 [15:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:03] (03PS1) 10Effie Mouzeli: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 [16:01:17] 10ops-codfw: Spare SSDs for titan2001 ? - https://phabricator.wikimedia.org/T359070 (10fgiunchedi) [16:01:56] 10ops-codfw: Spare SSDs for titan2001 ? - https://phabricator.wikimedia.org/T359070#9596690 (10fgiunchedi) [16:02:26] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1011.eqiad.wmnet with OS bullseye [16:02:39] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: (4) Elasticsearch instance elastic2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:03:13] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9596708 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1011.eqiad.wmnet with OS bullseye comp... [16:03:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Lacks a why and a what in the commit message." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 (owner: 10Effie Mouzeli) [16:05:16] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1020.eqiad.wmnet with OS bullseye [16:05:30] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9596718 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1020.eqiad.wmnet with OS bullseye [16:07:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (16) Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:07:52] 10ops-codfw: Spare SSDs for titan2001 ? - https://phabricator.wikimedia.org/T359070#9596740 (10Jhancock.wm) I have this on hand: - 8 x 300 GB SSD - 3 x 600 GB SSD - 3 x 800 GB SSD - 1 x 1.6 TB SSD Let me know which set you would like to go with. [16:08:22] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008501 (https://phabricator.wikimedia.org/T128546) [16:08:27] (03CR) 10BBlack: [C: 03+1] dns::auth: move all service state management to confd [puppet] - 10https://gerrit.wikimedia.org/r/1007918 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:09:34] (03PS2) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008501 (https://phabricator.wikimedia.org/T128546) [16:09:45] 06SRE, 10ops-codfw, 06DBA, 10decommission-hardware: decommission db2118.codfw.wmnet - https://phabricator.wikimedia.org/T358740#9596752 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:09:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P58365 and previous config saved to /var/cache/conftool/dbconfig/20240304-160945-arnaudb.json [16:12:08] !log sudo cumin "A:dns-rec" "disable-puppet 'merging CR 1007918'": T347054 [16:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:11] T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 [16:14:57] (03PS1) 10Ladsgroup: Set two more wikis to read new for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008503 (https://phabricator.wikimedia.org/T351237) [16:15:05] (03PS1) 10Alexandros Kosiaris: Move 5 codfw parsoid servers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008504 (https://phabricator.wikimedia.org/T358752) [16:15:39] !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=parsoid-php,dc=codfw,name=parse201[6-9].codfw.wmnet [16:15:46] !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: service=parsoid-php,dc=codfw,name=parse2020.codfw.wmnet [16:16:24] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic2093-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:16:57] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dns::auth: move all service state management to confd [puppet] - 10https://gerrit.wikimedia.org/r/1007918 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:17:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (14) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:18:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58366 and previous config saved to /var/cache/conftool/dbconfig/20240304-161806-arnaudb.json [16:18:11] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1020.eqiad.wmnet with reason: host reimage [16:19:00] (03PS3) 10BCornwall: slo_definitions: Switch to using haproxy_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) [16:19:13] (03CR) 10BCornwall: [V: 03+2 C: 03+2] slo_definitions: Switch to using haproxy_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [16:19:22] (03PS5) 10BCornwall: slo_definitions: Use trafficserver_backend_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) [16:19:28] (03CR) 10BCornwall: [V: 03+2 C: 03+2] slo_definitions: Use trafficserver_backend_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [16:19:32] 10SRE-swift-storage: 2024-2025 ms swift capacity - https://phabricator.wikimedia.org/T359077 (10MatthewVernon) [16:20:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1020.eqiad.wmnet with reason: host reimage [16:20:48] 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9596886 (10cmooney) [16:21:22] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=ntp [16:21:24] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: (3) Elasticsearch instance elastic2091-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:21:32] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=ntp [16:22:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (15) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:22:46] (03PS7) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [16:23:17] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=authdns-update [16:24:42] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=authdns-update [16:24:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T357189)', diff saved to https://phabricator.wikimedia.org/P58367 and previous config saved to /var/cache/conftool/dbconfig/20240304-162452-arnaudb.json [16:24:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1228.eqiad.wmnet with reason: Maintenance [16:24:56] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:25:02] (03PS8) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [16:25:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1228.eqiad.wmnet with reason: Maintenance [16:25:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T357189)', diff saved to https://phabricator.wikimedia.org/P58368 and previous config saved to /var/cache/conftool/dbconfig/20240304-162514-arnaudb.json [16:25:18] (03PS1) 10Majavah: conntrackd: fix CLI installation [puppet] - 10https://gerrit.wikimedia.org/r/1008506 [16:26:03] 10ops-codfw: Spare SSDs for titan2001 ? - https://phabricator.wikimedia.org/T359070#9596941 (10fgiunchedi) Thank you @Jhancock.wm ! I'd like to go for the 1x 1.6TB SSD please to be added to the existing SSDs in titan2001 [16:26:27] (03PS9) 10Arturo Borrero Gonzalez: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [16:27:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1008506 (owner: 10Majavah) [16:27:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (14) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:27:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 ', diff saved to https://phabricator.wikimedia.org/P58369 and previous config saved to /var/cache/conftool/dbconfig/20240304-162755-arnaudb.json [16:28:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2171.codfw.wmnet with reason: Silence for maintenance T356240 [16:28:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2171.codfw.wmnet with reason: Silence for maintenance T356240 [16:28:34] (03CR) 10Majavah: "PCC: https://puppet-compiler.wmflabs.org/output/1008506/1574/cloudgw1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/1008506 (owner: 10Majavah) [16:28:43] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2171.codfw.wmnet [16:29:21] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org,service=ntp [16:29:27] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org,service=ntp [16:30:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T357189)', diff saved to https://phabricator.wikimedia.org/P58370 and previous config saved to /var/cache/conftool/dbconfig/20240304-163002-arnaudb.json [16:30:05] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1630). [16:30:08] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:30:22] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008501 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:31:03] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008501 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:32:20] 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9596970 (10Gehel) [16:32:27] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9596996 (10ABran-WMF) thanks, I've preventively reloaded haproxy. Everything should be OK [16:32:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (13) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:32:47] (03CR) 10JHathaway: [C: 03+1] Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [16:33:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2171.codfw.wmnet [16:33:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58371 and previous config saved to /var/cache/conftool/dbconfig/20240304-163311-arnaudb.json [16:33:16] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org,service=authdns-update [16:33:42] !log running dummy authdns-update [16:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58372 and previous config saved to /var/cache/conftool/dbconfig/20240304-163358-arnaudb.json [16:34:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org,service=authdns-update [16:34:40] !log running dummy authdns-update [16:34:41] (ConfdResourceFailed) firing: confd resource _var_lib_dnsbox_authdns_ns2.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:46] hmm ok [16:35:26] (03PS2) 10Dbrant: Move account vanishing contact form to Meta wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005161 (https://phabricator.wikimedia.org/T343536) [16:37:31] (03CR) 10Arturo Borrero Gonzalez: "hey, what do you think about this approach? with this scheme, I think we can open holes only for the hosts that need them open." [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [16:37:38] (03CR) 10Muehlenhoff: [C: 03+2] airflow: Remove ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/1008407 (owner: 10Muehlenhoff) [16:37:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (12) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:39:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1020.eqiad.wmnet with OS bullseye [16:39:17] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597019 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1020.eqiad.wmnet with OS bullseye comp... [16:39:48] (03PS1) 10Ssingh: hiera: dnsbox: update service_type names [puppet] - 10https://gerrit.wikimedia.org/r/1008510 [16:40:53] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1021.eqiad.wmnet with OS bullseye [16:41:07] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1021.eqiad.wmnet with OS bullseye [16:41:55] (03CR) 10Clément Goubert: [C: 03+1] Move 5 codfw parsoid servers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008504 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris) [16:45:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P58373 and previous config saved to /var/cache/conftool/dbconfig/20240304-164508-arnaudb.json [16:45:48] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:46:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] Move 5 codfw parsoid servers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1008504 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris) [16:46:23] (03CR) 10Ssingh: [C: 03+2] hiera: dnsbox: update service_type names [puppet] - 10https://gerrit.wikimedia.org/r/1008510 (owner: 10Ssingh) [16:47:20] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir4001.ulsfo.wmnet [16:47:34] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:47:36] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:47:36] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:47:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:47:42] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:47:42] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:47:58] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:48:10] That's me.... [16:48:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58374 and previous config saved to /var/cache/conftool/dbconfig/20240304-164816-arnaudb.json [16:48:42] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 277818 seconds left:Certificate *.wikipedia.bg valid until 2024-04-13 06:06:54 +0000 (expires in 39 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:48:42] RECOVERY - HTTPS non-canonical-redirect-2 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 332897 seconds left:Certificate *.wikimania.com valid until 2024-05-25 10:21:04 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:48:58] RECOVERY - HTTPS non-canonical-redirect-4 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 322081 seconds left:Certificate *.wikispecies.net valid until 2024-05-25 08:20:38 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:49:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58375 and previous config saved to /var/cache/conftool/dbconfig/20240304-164903-arnaudb.json [16:49:19] (03PS1) 10Muehlenhoff: puppetboard: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1008513 [16:49:34] Hey SRE, just FYI, I'm doing a scap sync and I got a "Host key verification failed" for parse1021.eqiad.wmnet [16:49:53] jan_drewniak: probably due to some reinstalls... CC akosiaris ^^ [16:51:35] jan_drewniak: gimme a sec [16:51:47] !log akosiaris@cumin1002 conftool action : set/pooled=inactive; selector: service=parsoid-php,dc=codfw,name=parse2020.codfw.wmnet [16:52:17] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org,service=authdns-ns2 [16:52:50] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=authdns-ns2 [16:53:43] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1021.eqiad.wmnet with reason: host reimage [16:53:44] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org,service=authdns-ns2 [16:53:47] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=authdns-ns2 [16:53:57] jan_drewniak: it should be ok now, you just fell in the time window between the host being in the dsh scap list and it being removed. [16:54:15] sorry about that, I should have set the host as inactive, not depooled. [16:54:26] (SystemdUnitFailed) firing: (2) ferm.service on mw2384:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:37] akosiaris: np, should I restart the scap sync or let it keep running? [16:54:41] (ConfdResourceFailed) resolved: confd resource _var_lib_dnsbox_authdns_ns2.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:54:56] jan_drewniak: you can let it keep running, the host is no more a mediawiki host. [16:55:31] (03CR) 10JHathaway: [C: 03+1] puppetboard: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1008513 (owner: 10Muehlenhoff) [16:55:43] gotcha [16:55:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on elastic2107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:56:15] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1021.eqiad.wmnet with reason: host reimage [16:56:49] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns2004.wikimedia.org,service=authdns-ns1 [16:57:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (8) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:57:46] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2004.wikimedia.org,service=authdns-ns1 [16:59:21] !log sudo cumin -b1 -s120 "A:dns-rec" "run-puppet-agent --enable 'merging CR 1007918'": finish rolling out confd state management: T347054 [16:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:25] T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 [17:00:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P58376 and previous config saved to /var/cache/conftool/dbconfig/20240304-170015-arnaudb.json [17:03:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58377 and previous config saved to /var/cache/conftool/dbconfig/20240304-170320-arnaudb.json [17:04:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58378 and previous config saved to /var/cache/conftool/dbconfig/20240304-170408-arnaudb.json [17:05:22] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1006516 (https://phabricator.wikimedia.org/T358483) (owner: 10Majavah) [17:07:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (7) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:08:02] 06SRE, 10Prod-Kubernetes, 06serviceops: Kubernetes apiserver probe failures on restart - https://phabricator.wikimedia.org/T358936#9597218 (10RLazarus) [17:09:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2037.mgmt.codfw.wmnet with reboot policy FORCED [17:10:51] jouncebot: nowandnext [17:10:52] No deployments scheduled for the next 0 hour(s) and 49 minute(s) [17:10:52] In 0 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1800) [17:10:52] In 0 hour(s) and 49 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1800) [17:10:59] (03CR) 10Jforrester: [C: 03+2] ZObjectStore::updateZObjectAsSystemUser: Also give wf-staff rights [extensions/WikiLambda] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007885 (owner: 10Jforrester) [17:11:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [17:11:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [17:11:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007885 (owner: 10Jforrester) [17:12:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (7) Elasticsearch instance elastic2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:13:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:13:40] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T359086 (10phaultfinder) [17:14:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1021.eqiad.wmnet with OS bullseye [17:14:45] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1021.eqiad.wmnet with OS bullseye comp... [17:15:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T357189)', diff saved to https://phabricator.wikimedia.org/P58379 and previous config saved to /var/cache/conftool/dbconfig/20240304-171521-arnaudb.json [17:15:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance [17:15:30] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:15:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance [17:15:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T357189)', diff saved to https://phabricator.wikimedia.org/P58380 and previous config saved to /var/cache/conftool/dbconfig/20240304-171543-arnaudb.json [17:15:52] (03PS3) 10Jforrester: InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994831 (https://phabricator.wikimedia.org/T355462) (owner: 10Houseblaster) [17:15:59] (03Merged) 10jenkins-bot: ZObjectStore::updateZObjectAsSystemUser: Also give wf-staff rights [extensions/WikiLambda] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007885 (owner: 10Jforrester) [17:16:16] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1022.eqiad.wmnet with OS bullseye [17:16:32] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597325 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1022.eqiad.wmnet with OS bullseye [17:16:46] PROBLEM - Check whether ferm is active by checking the default input chain on mw2384 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:16:46] RECOVERY - HTTPS non-canonical-redirect-5 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikimedia.is has 244394 seconds left:Certificate wikimedia.is valid until 2024-04-11 10:06:15 +0000 (expires in 37 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:16:46] RECOVERY - HTTPS non-canonical-redirect-1 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 262273 seconds left:Certificate wikipedia.com valid until 2024-04-05 02:10:51 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:16:46] RECOVERY - HTTPS non-canonical-redirect-6 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.fi has 433273 seconds left:Certificate wikipedia.fi valid until 2024-05-03 08:30:14 +0000 (expires in 59 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:17:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (5) Elasticsearch instance elastic2090-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:17:50] (03CR) 10Jforrester: [C: 03+1] "Looks good to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994831 (https://phabricator.wikimedia.org/T355462) (owner: 10Houseblaster) [17:17:54] 06SRE, 10observability: Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105#9597339 (10Krinkle) [17:18:02] 06SRE, 10observability, 07Grafana: Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105#9597341 (10Krinkle) [17:18:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:18:26] 06SRE, 10ops-codfw: Spare SSDs for titan2001 ? - https://phabricator.wikimedia.org/T359070#9597348 (10Jhancock.wm) It's been inserted. Lemme know if you need anything else! [17:18:34] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9597351 (10Marostegui) [17:18:45] jan_drewniak: Scap still running? [17:18:58] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:19:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58381 and previous config saved to /var/cache/conftool/dbconfig/20240304-171913-arnaudb.json [17:20:16] James_F: hey, yeah, still running. It looks like it's produced a few errors but only because of the parse2019.codfw.wmnet depooling [17:20:25] Ack. [17:20:56] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1008501| Bumping portals to master (T128546)]] (duration: 45m 54s) [17:20:59] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [17:21:01] Aha. [17:21:06] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:1007885|ZObjectStore::updateZObjectAsSystemUser: Also give wf-staff rights]] [17:21:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T357189)', diff saved to https://phabricator.wikimedia.org/P58382 and previous config saved to /var/cache/conftool/dbconfig/20240304-172136-arnaudb.json [17:21:40] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:21:57] James_F: K looks like it's done [17:22:06] Yeah, perfect. :-) [17:22:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (5) Elasticsearch instance elastic2090-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:24:26] (SystemdUnitFailed) firing: (2) ferm.service on mw2384:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:27:39] (CirrusSearchNodeIndexingNotIncreasing) resolved: (4) Elasticsearch instance elastic2090-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:29:08] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1022.eqiad.wmnet with reason: host reimage [17:30:35] 06SRE, 10ops-eqiad, 06DC-Ops: hw move: GPU from stat1005 to stat1010 - https://phabricator.wikimedia.org/T358763#9597423 (10Jclark-ctr) Removed gpu from stat1005 found power plug has changed between 730xd to 740xd. both servers powered on with no gpu. opened ticket requesting cable ordered T359089 [17:31:53] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1022.eqiad.wmnet with reason: host reimage [17:34:26] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1007885|ZObjectStore::updateZObjectAsSystemUser: Also give wf-staff rights]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:34:34] !log jforrester@deploy2002 jforrester: Continuing with sync [17:36:35] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1012.eqiad.wmnet with OS bullseye [17:36:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P58383 and previous config saved to /var/cache/conftool/dbconfig/20240304-173642-arnaudb.json [17:36:51] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1012.eqiad.wmnet with OS bullseye [17:46:46] RECOVERY - Check whether ferm is active by checking the default input chain on mw2384 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:46:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T354015)', diff saved to https://phabricator.wikimedia.org/P58384 and previous config saved to /var/cache/conftool/dbconfig/20240304-174653-marostegui.json [17:46:57] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [17:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:49:11] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1012.eqiad.wmnet with reason: host reimage [17:49:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1022.eqiad.wmnet with OS bullseye [17:49:27] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1022.eqiad.wmnet with OS bullseye comp... [17:51:30] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1012.eqiad.wmnet with reason: host reimage [17:51:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P58385 and previous config saved to /var/cache/conftool/dbconfig/20240304-175148-arnaudb.json [17:52:39] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1023.eqiad.wmnet with OS bullseye [17:52:55] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597521 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1023.eqiad.wmnet with OS bullseye [17:59:51] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:1007885|ZObjectStore::updateZObjectAsSystemUser: Also give wf-staff rights]] (duration: 38m 44s) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1800) [18:00:05] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T1800). [18:01:40] 06SRE, 10SRE-Access-Requests: Requesting access to kubernetes deployment for tjones - https://phabricator.wikimedia.org/T359092 (10TJones) [18:02:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P58386 and previous config saved to /var/cache/conftool/dbconfig/20240304-180159-marostegui.json [18:06:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T357189)', diff saved to https://phabricator.wikimedia.org/P58387 and previous config saved to /var/cache/conftool/dbconfig/20240304-180655-arnaudb.json [18:06:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance [18:07:03] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:07:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance [18:07:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T357189)', diff saved to https://phabricator.wikimedia.org/P58388 and previous config saved to /var/cache/conftool/dbconfig/20240304-180717-arnaudb.json [18:08:33] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1023.eqiad.wmnet with reason: host reimage [18:09:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1012.eqiad.wmnet with OS bullseye [18:09:31] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1012.eqiad.wmnet with OS bullseye comp... [18:12:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T357189)', diff saved to https://phabricator.wikimedia.org/P58389 and previous config saved to /var/cache/conftool/dbconfig/20240304-181219-arnaudb.json [18:12:24] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:16:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2037.mgmt.codfw.wmnet with reboot policy FORCED [18:17:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P58390 and previous config saved to /var/cache/conftool/dbconfig/20240304-181705-marostegui.json [18:23:17] 06SRE, 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T359086#9597617 (10VRiley-WMF) a:03VRiley-WMF [18:24:11] 06SRE, 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T359086#9597619 (10VRiley-WMF) Reseated the power supply cable. Monitored issue and the error has been resolved. [18:24:17] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9597620 (10Jhancock.wm) [18:24:28] 06SRE, 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T359086#9597621 (10VRiley-WMF) 05Open→03Resolved [18:24:59] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2035'] [18:26:06] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2035'] [18:26:07] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1023.eqiad.wmnet with OS bullseye [18:26:24] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1023.eqiad.wmnet with OS bullseye comp... [18:26:48] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1024.eqiad.wmnet with OS bullseye [18:26:52] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2035'] [18:27:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2035'] [18:27:01] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597623 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1024.eqiad.wmnet with OS bullseye [18:27:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P58391 and previous config saved to /var/cache/conftool/dbconfig/20240304-182726-arnaudb.json [18:27:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [18:27:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [18:29:32] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2036'] [18:29:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2036'] [18:32:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T354015)', diff saved to https://phabricator.wikimedia.org/P58392 and previous config saved to /var/cache/conftool/dbconfig/20240304-183212-marostegui.json [18:32:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [18:32:18] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [18:32:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [18:40:34] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parse1024.eqiad.wmnet with OS bullseye [18:40:49] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9597703 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1024.eqiad.wmnet with OS bullseye exec... [18:42:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P58393 and previous config saved to /var/cache/conftool/dbconfig/20240304-184234-arnaudb.json [18:50:29] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [18:50:31] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [18:50:31] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [18:50:37] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [18:50:37] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [18:50:51] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [18:57:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T357189)', diff saved to https://phabricator.wikimedia.org/P58394 and previous config saved to /var/cache/conftool/dbconfig/20240304-185740-arnaudb.json [18:57:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [18:57:44] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:57:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [18:59:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:00:09] !log htriedman@deploy2002 Started deploy [airflow-dags/analytics_product@a076d5c]: (no justification provided) [19:00:18] !log htriedman@deploy2002 Finished deploy [airflow-dags/analytics_product@a076d5c]: (no justification provided) (duration: 00m 09s) [19:00:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [19:00:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [19:03:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [19:03:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [19:06:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance [19:06:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance [19:10:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [19:10:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [19:10:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2103 (T357189)', diff saved to https://phabricator.wikimedia.org/P58395 and previous config saved to /var/cache/conftool/dbconfig/20240304-191028-arnaudb.json [19:10:32] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:16:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T357189)', diff saved to https://phabricator.wikimedia.org/P58396 and previous config saved to /var/cache/conftool/dbconfig/20240304-191601-arnaudb.json [19:16:06] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:31:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P58398 and previous config saved to /var/cache/conftool/dbconfig/20240304-193108-arnaudb.json [19:33:30] 06SRE, 10SRE-Access-Requests: Requesting access to kubernetes deployment for tjones - https://phabricator.wikimedia.org/T359092#9597927 (10Gehel) Approved as Trey's manager. [19:33:49] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.04 - 2024.03.24): Requesting access to kubernetes deployment for tjones - https://phabricator.wikimedia.org/T359092#9597928 (10Gehel) [19:46:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P58399 and previous config saved to /var/cache/conftool/dbconfig/20240304-194614-arnaudb.json [19:56:36] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2104.codfw.wmnet with OS bullseye [19:56:48] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@a076d5c]: (no justification provided) [19:57:14] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@a076d5c]: (no justification provided) (duration: 00m 26s) [19:58:47] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2105.codfw.wmnet with OS bullseye [20:01:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T357189)', diff saved to https://phabricator.wikimedia.org/P58400 and previous config saved to /var/cache/conftool/dbconfig/20240304-200121-arnaudb.json [20:01:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [20:01:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [20:01:39] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:01:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T357189)', diff saved to https://phabricator.wikimedia.org/P58401 and previous config saved to /var/cache/conftool/dbconfig/20240304-200143-arnaudb.json [20:02:47] RECOVERY - HTTPS non-canonical-redirect-6 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.fi has 423313 seconds left:Certificate wikipedia.fi valid until 2024-05-03 08:30:14 +0000 (expires in 59 days) https://wikitech.wikimedia.org/wiki/Ncredir [20:02:49] RECOVERY - HTTPS non-canonical-redirect-5 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikimedia.is has 234431 seconds left:Certificate wikimedia.is valid until 2024-04-11 10:06:15 +0000 (expires in 37 days) https://wikitech.wikimedia.org/wiki/Ncredir [20:02:49] RECOVERY - HTTPS non-canonical-redirect-1 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 252311 seconds left:Certificate wikipedia.com valid until 2024-04-05 02:10:51 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/Ncredir [20:02:51] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 266168 seconds left:Certificate *.wikipedia.bg valid until 2024-04-13 06:06:54 +0000 (expires in 39 days) https://wikitech.wikimedia.org/wiki/Ncredir [20:02:51] RECOVERY - HTTPS non-canonical-redirect-2 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 321248 seconds left:Certificate *.wikimania.com valid until 2024-05-25 10:21:04 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/Ncredir [20:03:13] RECOVERY - HTTPS non-canonical-redirect-4 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 310426 seconds left:Certificate *.wikispecies.net valid until 2024-05-25 08:20:38 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/Ncredir [20:08:28] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:08:35] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:12:49] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2104.codfw.wmnet with reason: host reimage [20:14:44] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2105.codfw.wmnet with reason: host reimage [20:15:47] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2104.codfw.wmnet with reason: host reimage [20:16:10] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1008528 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:18:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2105.codfw.wmnet with reason: host reimage [20:19:34] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1008535 [20:25:33] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:25:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:28:09] (03PS2) 10Herron: profile::kafka::broker: set cert renewal at 1 month [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) [20:28:56] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:29:03] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:31:01] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:31:08] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:33:05] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:33:11] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:33:38] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5025.eqsin.wmnet [20:34:21] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp5025.eqsin.wmnet with reason: T355905 [20:34:33] T355905: Restarting fifo-log-demux should not restart nginx - https://phabricator.wikimedia.org/T355905 [20:34:38] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp5025.eqsin.wmnet with reason: T355905 [20:37:39] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:37:45] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:38:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2104.codfw.wmnet with OS bullseye [20:41:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2105.codfw.wmnet with OS bullseye [20:41:09] (03PS1) 10Dzahn: ci_test: add profile::ci::website to allow deployments [puppet] - 10https://gerrit.wikimedia.org/r/1008539 [20:41:45] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:41:52] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:46:26] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:46:33] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:47:44] (03PS3) 10Bking: elastic: move elastic2107 and 2108 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1008528 (https://phabricator.wikimedia.org/T353878) [20:49:03] (03PS2) 10Dzahn: ci_test: add profile::ci::website to allow deployments [puppet] - 10https://gerrit.wikimedia.org/r/1008539 (https://phabricator.wikimedia.org/T358237) [20:50:02] (03CR) 10Dzahn: [C: 03+1] elastic: move elastic2107 and 2108 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1008528 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:50:20] (03CR) 10Bking: [C: 03+2] elastic: move elastic2107 and 2108 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1008528 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:50:33] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:50:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:52:50] (03CR) 10Dzahn: [C: 03+2] ci_test: add profile::ci::website to allow deployments [puppet] - 10https://gerrit.wikimedia.org/r/1008539 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [20:52:52] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:52:59] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:53:07] (03PS10) 10BCornwall: fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) [20:54:17] (03PS1) 10Bartosz Dziewoński: HandleSectionLinks: Fix handling headings with raw `>` in attributes [core] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1008472 (https://phabricator.wikimedia.org/T358810) [20:55:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:55:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on elastic2107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:55:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:57:38] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:57:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:58:20] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2107.codfw.wmnet with OS bullseye [20:58:28] (03CR) 10Dzahn: [C: 03+2] "Info: Applying configuration version '(a39fe81517) Dzahn - ci_test: add profile::ci::website to allow deployments'" [puppet] - 10https://gerrit.wikimedia.org/r/1008539 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [20:58:47] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2108.codfw.wmnet with OS bullseye [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T2100). nyaa~ [21:00:04] houseblaster, dbrant, Jdlrobson, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:33] o/ [21:00:37] o/ [21:00:51] hi! [21:00:52] hi [21:02:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T357189)', diff saved to https://phabricator.wikimedia.org/P58403 and previous config saved to /var/cache/conftool/dbconfig/20240304-210214-arnaudb.json [21:02:21] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:05:44] (03PS1) 10Dzahn: ci_test: test switching firewall provider back to iptables [puppet] - 10https://gerrit.wikimedia.org/r/1008545 (https://phabricator.wikimedia.org/T358237) [21:08:34] hi - i can deploy unless someone is already at it? [21:09:17] cjming: all yours [21:09:32] cool - i'll go in order of the queue [21:10:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994831 (https://phabricator.wikimedia.org/T355462) (owner: 10Houseblaster) [21:11:02] (03CR) 10Dzahn: [C: 03+2] ci_test: test switching firewall provider back to iptables [puppet] - 10https://gerrit.wikimedia.org/r/1008545 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [21:11:45] (03Merged) 10jenkins-bot: InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994831 (https://phabricator.wikimedia.org/T355462) (owner: 10Houseblaster) [21:12:03] !log cjming@deploy2002 Started scap: Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]] [21:12:12] T355462: Set $wgSignatureValidation to disallow [enwiki] - https://phabricator.wikimedia.org/T355462 [21:14:15] !log bking@cumin2002 depool wdqs2007 for T355873 [21:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:18] T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 [21:17:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P58404 and previous config saved to /var/cache/conftool/dbconfig/20240304-211721-arnaudb.json [21:20:23] if any SREs are around, my terminal seems to be choking on deploying to test servers - should be quick with a simple config change - any suggestions? [21:24:20] !log cjming@deploy2002 cjming and houseblaster: Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:24:26] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:29] nvm - finally went thru [21:24:33] T355462: Set $wgSignatureValidation to disallow [enwiki] - https://phabricator.wikimedia.org/T355462 [21:24:39] houseblaster: up on test servers if you want to check [21:25:18] My change is working :) [21:25:24] cool - syncing [21:25:26] !log cjming@deploy2002 cjming and houseblaster: Continuing with sync [21:28:56] in case anyone is around, something does seem off/pokey -- syncing also appears to be getting stuck [21:29:08] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:29:14] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:32:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P58405 and previous config saved to /var/cache/conftool/dbconfig/20240304-213228-arnaudb.json [21:33:28] looking... [21:43:19] sorry everyone - something is not right -- deployments are choking -- i've been advised to file a ticket and abort the backport window until it's resolved [21:43:39] houseblaster: your patch is not fully deployed [21:46:21] hmm, thanks for letting us know. i'll schedule my patch for tomorrow then, it probably wouldn't have made it in time anyway [21:47:19] I do have to go in a minute. Should I reschedule for tomorrow? [21:47:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T357189)', diff saved to https://phabricator.wikimedia.org/P58406 and previous config saved to /var/cache/conftool/dbconfig/20240304-214734-arnaudb.json [21:47:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [21:47:38] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:47:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [21:47:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T357189)', diff saved to https://phabricator.wikimedia.org/P58407 and previous config saved to /var/cache/conftool/dbconfig/20240304-214757-arnaudb.json [21:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:48:41] houseblaster: i actually think your patch might have just finished syncing -- if you can check that it's live in about 5 minutes (php restarts are about 1/2 way thru) -- if it's not, then please reschedule your patch [21:49:43] Can do. [21:49:47] cjming: Based on the helm rollbacks that happened, the change might be partially deployed (fully deployed on bare metal servers, possibly rolled back on k8s pods). [21:50:38] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]] (duration: 38m 34s) [21:50:42] T355462: Set $wgSignatureValidation to disallow [enwiki] - https://phabricator.wikimedia.org/T355462 [21:51:05] dancy: thanks - gtk - ya, my terminal just said the backport failed. Ticket incoming [21:51:35] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:51:41] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:54:22] houseblaster: see my reply above - please reschedule, looks like syncing failed [21:55:18] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:55:25] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:55:32] backport window is bust - closing for now [21:56:08] !log end of UTC late backport window due to deployment errors [21:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T357189)', diff saved to https://phabricator.wikimedia.org/P58408 and previous config saved to /var/cache/conftool/dbconfig/20240304-215626-arnaudb.json [21:56:30] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:57:25] (SystemdUnitFailed) firing: (3) confd_prometheus_metrics.service on elastic2088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:04] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240304T2200). [22:00:17] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:00:24] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:01:17] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [22:02:25] (SystemdUnitFailed) resolved: (4) confd_prometheus_metrics.service on elastic2088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:06:26] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:06:33] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:06:49] 06SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9598535 (10wiki_willy) Thanks for confirming, @Volans. If everyone else is ok with making the correlation on the accounting spreadsheet, my vote is that we go with that route. Thanks, Willy [22:09:32] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:09:38] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:11:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P58409 and previous config saved to /var/cache/conftool/dbconfig/20240304-221132-arnaudb.json [22:11:34] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:11:41] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:19:14] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2107.codfw.wmnet with OS bullseye [22:19:22] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp5025.eqsin.wmnet with reason: T355905 [22:19:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp5025.eqsin.wmnet with reason: T355905 [22:19:29] T355905: Restarting fifo-log-demux should not restart nginx - https://phabricator.wikimedia.org/T355905 [22:19:41] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2108.codfw.wmnet with OS bullseye [22:25:49] (03PS2) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455) [22:25:54] (03PS2) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) [22:25:59] (03PS1) 10Andrew Bogott: role::puppetserver::cloud_vps_project: remove firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1008554 (https://phabricator.wikimedia.org/T351450) [22:26:39] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:26:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P58410 and previous config saved to /var/cache/conftool/dbconfig/20240304-222639-arnaudb.json [22:31:39] (CirrusSearchNodeIndexingNotIncreasing) resolved: Elasticsearch instance elastic2088-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:33:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:33:34] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:41:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T357189)', diff saved to https://phabricator.wikimedia.org/P58411 and previous config saved to /var/cache/conftool/dbconfig/20240304-224145-arnaudb.json [22:41:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [22:41:50] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [22:42:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [22:42:49] (03CR) 10Krinkle: [C: 03+2] Profiler: Silence "RedisException: Connection timed out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003083 (https://phabricator.wikimedia.org/T348756) (owner: 10Krinkle) [22:43:02] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2035'] [22:43:45] (03PS2) 10Effie Mouzeli: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 [22:43:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2035'] [22:43:54] (03Merged) 10jenkins-bot: Profiler: Silence "RedisException: Connection timed out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003083 (https://phabricator.wikimedia.org/T348756) (owner: 10Krinkle) [22:44:17] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2035'] [22:44:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2035'] [22:45:19] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [22:45:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [22:45:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T357189)', diff saved to https://phabricator.wikimedia.org/P58412 and previous config saved to /var/cache/conftool/dbconfig/20240304-224550-arnaudb.json [22:47:28] !log deployed patch for T357760 [22:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [22:49:39] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2088-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:59:08] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:59:15] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:59:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:59:39] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2088-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [23:01:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [23:01:33] maryum: Did the security path deployment go smoothly? [23:01:42] *patch [23:01:50] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2035'] [23:02:08] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2035'] [23:03:02] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:03:09] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:05:12] I can verify that the security patch worked (if that is what you are asking). [23:05:58] In particular I'm wondering if the kubernetes part of the deployment ran smoothly. There were problems earlier. [23:08:26] Hmm.. I'm looking through logstash and appears that the problem persists. [23:11:10] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:11:17] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:13:44] 06SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9598738 (10Volans) Sounds good to me, let me know once done so that I can make the related changes to the report to include those too. [23:15:22] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9598740 (10odimitrijevic) Approved [23:16:13] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:16:19] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:17:03] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9598742 (10odimitrijevic) Yes, approved [23:18:17] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:18:23] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:20:31] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:20:37] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:24:29] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:24:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:28:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T357189)', diff saved to https://phabricator.wikimedia.org/P58413 and previous config saved to /var/cache/conftool/dbconfig/20240304-232826-arnaudb.json [23:28:30] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [23:28:42] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:28:49] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:30:46] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:30:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:31:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [23:31:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2036.mgmt.codfw.wmnet with reboot policy FORCED [23:32:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2037.mgmt.codfw.wmnet with reboot policy FORCED [23:32:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2038.mgmt.codfw.wmnet with reboot policy FORCED [23:32:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2039.mgmt.codfw.wmnet with reboot policy FORCED [23:32:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2040.mgmt.codfw.wmnet with reboot policy FORCED [23:32:09] !log dancy@deploy2002 Installing scap version "4.68.0" for 413 hosts [23:32:50] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:32:57] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:35:08] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:35:15] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:37:52] !log dancy@deploy2002 Locking from deployment [mediawiki]: Mediawiki deployments locked pending resolution of T359114 [23:37:56] T359114: Slow and failed deployments - https://phabricator.wikimedia.org/T359114 [23:38:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:38:33] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:39:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [23:39:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [23:40:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2040.mgmt.codfw.wmnet with reboot policy FORCED [23:41:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [23:43:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P58414 and previous config saved to /var/cache/conftool/dbconfig/20240304-234332-arnaudb.json [23:44:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2036.mgmt.codfw.wmnet with reboot policy FORCED [23:44:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2039.mgmt.codfw.wmnet with reboot policy FORCED [23:48:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2037.mgmt.codfw.wmnet with reboot policy FORCED [23:48:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2038.mgmt.codfw.wmnet with reboot policy FORCED [23:50:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2035.codfw.wmnet with OS bookworm [23:50:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2036.codfw.wmnet with OS bookworm [23:50:14] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9598783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2035.codfw.wmnet with OS bookworm [23:50:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2037.codfw.wmnet with OS bookworm [23:50:17] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9598784 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2036.codfw.wmnet with OS bookworm [23:50:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2038.codfw.wmnet with OS bookworm [23:50:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2039.codfw.wmnet with OS bookworm [23:50:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2040.codfw.wmnet with OS bookworm [23:50:23] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9598785 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2037.codfw.wmnet with OS bookworm [23:50:35] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9598786 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2038.codfw.wmnet with OS bookworm [23:50:47] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9598787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2039.codfw.wmnet with OS bookworm [23:50:59] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9598788 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2040.codfw.wmnet with OS bookworm [23:52:53] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for bdgreenlee - https://phabricator.wikimedia.org/T359123 (10bdgreenlee) [23:58:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P58415 and previous config saved to /var/cache/conftool/dbconfig/20240304-235839-arnaudb.json