[00:02:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T376905)', diff saved to https://phabricator.wikimedia.org/P69866 and previous config saved to /var/cache/conftool/dbconfig/20241015-000236-ladsgroup.json [00:02:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance [00:02:58] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance [00:03:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T376905)', diff saved to https://phabricator.wikimedia.org/P69867 and previous config saved to /var/cache/conftool/dbconfig/20241015-000304-ladsgroup.json [00:09:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227375 (10phaultfinder) [00:10:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T376905)', diff saved to https://phabricator.wikimedia.org/P69868 and previous config saved to /var/cache/conftool/dbconfig/20241015-001004-ladsgroup.json [00:10:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T370903)', diff saved to https://phabricator.wikimedia.org/P69869 and previous config saved to /var/cache/conftool/dbconfig/20241015-001024-ladsgroup.json [00:10:28] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:15:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1080096 (owner: 10TrainBranchBot) [00:25:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P69870 and previous config saved to /var/cache/conftool/dbconfig/20241015-002511-ladsgroup.json [00:25:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P69871 and previous config saved to /var/cache/conftool/dbconfig/20241015-002531-ladsgroup.json [00:40:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P69872 and previous config saved to /var/cache/conftool/dbconfig/20241015-004018-ladsgroup.json [00:40:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P69873 and previous config saved to /var/cache/conftool/dbconfig/20241015-004039-ladsgroup.json [00:52:22] (03PS3) 10Ebomani: Updating Patch Demo plugin to return legacy/new URL as needed [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079624 (https://phabricator.wikimedia.org/T374954) [00:55:00] (03CR) 10Ebomani: "Hello Antoine, here are the changes to the plugin to address the missing PatchDemo links issue (https://phabricator.wikimedia.org/T374954)" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079624 (https://phabricator.wikimedia.org/T374954) (owner: 10Ebomani) [00:55:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T376905)', diff saved to https://phabricator.wikimedia.org/P69874 and previous config saved to /var/cache/conftool/dbconfig/20241015-005525-ladsgroup.json [00:55:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance [00:55:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance [00:55:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T370903)', diff saved to https://phabricator.wikimedia.org/P69875 and previous config saved to /var/cache/conftool/dbconfig/20241015-005546-ladsgroup.json [00:55:50] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:55:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T376905)', diff saved to https://phabricator.wikimedia.org/P69876 and previous config saved to /var/cache/conftool/dbconfig/20241015-005551-ladsgroup.json [01:02:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T376905)', diff saved to https://phabricator.wikimedia.org/P69877 and previous config saved to /var/cache/conftool/dbconfig/20241015-010242-ladsgroup.json [01:08:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.27 [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080100 (https://phabricator.wikimedia.org/T375658) [01:08:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.27 [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080100 (https://phabricator.wikimedia.org/T375658) (owner: 10TrainBranchBot) [01:17:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P69878 and previous config saved to /var/cache/conftool/dbconfig/20241015-011749-ladsgroup.json [01:32:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P69879 and previous config saved to /var/cache/conftool/dbconfig/20241015-013257-ladsgroup.json [01:34:50] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.27 [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080100 (https://phabricator.wikimedia.org/T375658) (owner: 10TrainBranchBot) [01:48:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T376905)', diff saved to https://phabricator.wikimedia.org/P69880 and previous config saved to /var/cache/conftool/dbconfig/20241015-014803-ladsgroup.json [01:48:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [01:48:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [01:48:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T376905)', diff saved to https://phabricator.wikimedia.org/P69881 and previous config saved to /var/cache/conftool/dbconfig/20241015-014831-ladsgroup.json [01:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227464 (10phaultfinder) [01:55:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T376905)', diff saved to https://phabricator.wikimedia.org/P69882 and previous config saved to /var/cache/conftool/dbconfig/20241015-015516-ladsgroup.json [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T0200) [02:10:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P69883 and previous config saved to /var/cache/conftool/dbconfig/20241015-021023-ladsgroup.json [02:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227471 (10phaultfinder) [02:25:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P69884 and previous config saved to /var/cache/conftool/dbconfig/20241015-022530-ladsgroup.json [02:37:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T376905)', diff saved to https://phabricator.wikimedia.org/P69885 and previous config saved to /var/cache/conftool/dbconfig/20241015-024037-ladsgroup.json [02:40:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:40:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:49:53] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227479 (10phaultfinder) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T0300) [03:02:10] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080105 (https://phabricator.wikimedia.org/T375658) [03:02:11] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080105 (https://phabricator.wikimedia.org/T375658) (owner: 10TrainBranchBot) [03:02:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:58] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080105 (https://phabricator.wikimedia.org/T375658) (owner: 10TrainBranchBot) [03:03:09] (03CR) 10Tim Starling: [C:03+1] Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [03:03:20] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.43.0-wmf.27 refs T375658 [03:03:24] T375658: 1.43.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T375658 [03:03:46] (03CR) 10Tim Starling: [C:03+1] Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [03:09:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227488 (10phaultfinder) [03:19:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:20:38] (03CR) 10Tim Starling: [C:03+1] Missing.php: Improve detection of interwikis in certain cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) (owner: 10Pppery) [03:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227504 (10phaultfinder) [03:51:51] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.43.0-wmf.27 refs T375658 (duration: 48m 30s) [03:51:54] T375658: 1.43.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T375658 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T0400) [04:00:58] !log mwpresync@deploy2002 Pruned MediaWiki: 1.43.0-wmf.24 (duration: 00m 56s) [04:14:54] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227527 (10phaultfinder) [04:41:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:46:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:49:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227529 (10phaultfinder) [04:54:15] Updating MinT in production.. [04:55:21] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-10-11-113932-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079682 (https://phabricator.wikimedia.org/T368521) (owner: 10KartikMistry) [04:56:20] (03Merged) 10jenkins-bot: Update MinT to 2024-10-11-113932-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079682 (https://phabricator.wikimedia.org/T368521) (owner: 10KartikMistry) [05:09:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227534 (10phaultfinder) [05:10:27] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:15:41] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [05:22:57] (03PS2) 10Giuseppe Lavagetto: idp: add entry for requestctl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1080034 (https://phabricator.wikimedia.org/T371782) [05:23:17] (03CR) 10Giuseppe Lavagetto: idp: add entry for requestctl.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1080034 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [05:27:22] (03CR) 10Giuseppe Lavagetto: [C:03+2] idp: add entry for requestctl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1080034 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [05:35:35] <_joe_> !log restart tomcat on idp2004 [05:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:51] <_joe_> !log restart tomcat on idp1004 [05:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:17] (03PS1) 10Giuseppe Lavagetto: hiddenparma: fix static routes, proxy pass directive [puppet] - 10https://gerrit.wikimedia.org/r/1080109 (https://phabricator.wikimedia.org/T371782) [05:54:10] (03CR) 10Giuseppe Lavagetto: [C:03+2] hiddenparma: fix static routes, proxy pass directive [puppet] - 10https://gerrit.wikimedia.org/r/1080109 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T0600) [06:00:05] marostegui, Amir1, and arnaudb: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T0600). [06:05:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:11] (03PS1) 10Kevin Bazira: ml-services: add article-country isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080111 (https://phabricator.wikimedia.org/T371897) [06:08:11] Amir1: arnaudb I'm still doing MinT deployment. Was on the staging for testing.. [06:08:45] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227569 (10phaultfinder) [06:10:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:16:43] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [06:18:46] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227570 (10phaultfinder) [06:24:55] (03CR) 10Brouberol: ceph-csi-cephfs: replace the ClusterRole by a list of ns-scoped Roles (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080032 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [06:27:51] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:30:18] !log Updated MinT to 2024-10-11-113932-production [06:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:24] (03PS2) 10Brouberol: ceph-csi-cephfs: replace the ClusterRole by a list of ns-scoped Roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080032 (https://phabricator.wikimedia.org/T376406) [06:50:25] (03PS8) 10Brouberol: Define the ceph-csi-cephfs admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) [06:50:36] (03CR) 10Brouberol: ceph-csi-cephfs: replace the ClusterRole by a list of ns-scoped Roles (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080032 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [06:51:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1232.eqiad.wmnet with reason: Maintenance [06:51:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1232.eqiad.wmnet with reason: Maintenance [06:51:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T367781)', diff saved to https://phabricator.wikimedia.org/P69887 and previous config saved to /var/cache/conftool/dbconfig/20241015-065130-arnaudb.json [06:51:34] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [06:53:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T367781)', diff saved to https://phabricator.wikimedia.org/P69888 and previous config saved to /var/cache/conftool/dbconfig/20241015-065345-arnaudb.json [06:54:00] (03PS9) 10Brouberol: Define the ceph-csi-cephfs admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) [07:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:03:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 25%: post sunday p.age T368098', diff saved to https://phabricator.wikimedia.org/P69889 and previous config saved to /var/cache/conftool/dbconfig/20241015-070327-arnaudb.json [07:03:31] T368098: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [07:08:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P69890 and previous config saved to /var/cache/conftool/dbconfig/20241015-070852-arnaudb.json [07:18:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 50%: post sunday p.age T368098', diff saved to https://phabricator.wikimedia.org/P69891 and previous config saved to /var/cache/conftool/dbconfig/20241015-071833-arnaudb.json [07:18:37] T368098: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [07:19:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:50] (03CR) 10David Caro: P:toolforge::proxy: use svc.toolforge.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1080056 (owner: 10Majavah) [07:24:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P69892 and previous config saved to /var/cache/conftool/dbconfig/20241015-072359-arnaudb.json [07:30:36] jouncebot: refresh [07:30:36] I refreshed my knowledge about deployments. [07:30:38] jouncebot: nowandnext [07:30:39] For the next 0 hour(s) and 29 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T0700) [07:30:39] In 2 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T1000) [07:33:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 75%: post sunday p.age T368098', diff saved to https://phabricator.wikimedia.org/P69893 and previous config saved to /var/cache/conftool/dbconfig/20241015-073338-arnaudb.json [07:33:42] T368098: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [07:35:00] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [07:36:44] I am going to upgrade Gerrit, it will be unavailable for some minutes [07:38:04] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit[1003,2002-2003].wikimedia.org with reason: Gerrit 3.10.2 update [07:38:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit[1003,2002-2003].wikimedia.org with reason: Gerrit 3.10.2 update [07:39:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T367781)', diff saved to https://phabricator.wikimedia.org/P69894 and previous config saved to /var/cache/conftool/dbconfig/20241015-073906-arnaudb.json [07:39:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1234.eqiad.wmnet with reason: Maintenance [07:39:10] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [07:39:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1234.eqiad.wmnet with reason: Maintenance [07:39:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T367781)', diff saved to https://phabricator.wikimedia.org/P69895 and previous config saved to /var/cache/conftool/dbconfig/20241015-073928-arnaudb.json [07:40:13] !log hashar@deploy2002 Started deploy [gerrit/gerrit@2f0c927]: Gerrit to 3.10.2 on gerrit2003 - T373897 [07:40:16] T373897: Upgrade to Gerrit 3.10.2 - https://phabricator.wikimedia.org/T373897 [07:40:20] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@2f0c927]: Gerrit to 3.10.2 on gerrit2003 - T373897 (duration: 00m 07s) [07:41:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T367781)', diff saved to https://phabricator.wikimedia.org/P69896 and previous config saved to /var/cache/conftool/dbconfig/20241015-074143-arnaudb.json [07:42:04] !log hashar@deploy2002 Started deploy [gerrit/gerrit@2f0c927]: Gerrit to 3.10.2 on gerrit2002 - T373897 [07:42:11] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@2f0c927]: Gerrit to 3.10.2 on gerrit2002 - T373897 (duration: 00m 07s) [07:44:44] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1080042 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [07:46:03] !log hashar@deploy2002 Started deploy [gerrit/gerrit@2f0c927]: Gerrit to 3.10.2 on gerrit1003 - T373897 [07:46:07] T373897: Upgrade to Gerrit 3.10.2 - https://phabricator.wikimedia.org/T373897 [07:46:12] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@2f0c927]: Gerrit to 3.10.2 on gerrit1003 - T373897 (duration: 00m 09s) [07:47:15] !log Restarted Gerrit - T373897 [07:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:47] FIRING: HelmReleaseBadStatus: Helm release kube-system/calico on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:47:50] * arnaudb restarts his CI job [07:47:53] thanks hashar ! [07:48:34] ah yes that tends to break some of the jobs unfortunately [07:48:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 100%: post sunday p.age T368098', diff saved to https://phabricator.wikimedia.org/P69897 and previous config saved to /var/cache/conftool/dbconfig/20241015-074843-arnaudb.json [07:48:47] though most would sneak through [07:48:47] T368098: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [07:49:10] pulling dependencies from gerrit was hard :P [07:49:39] (03CR) 10DCausse: [C:03+1] "this should be good to go, prod is running wmf.26" [puppet] - 10https://gerrit.wikimedia.org/r/1073529 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [07:56:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P69898 and previous config saved to /var/cache/conftool/dbconfig/20241015-075650-arnaudb.json [08:01:15] (03PS2) 10Arnaudb: mariadb: pii cleaner cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T377174) [08:01:15] (03CR) 10Arnaudb: "This cookbook will be intensively use as soon as its merged 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T377174) (owner: 10Arnaudb) [08:11:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P69899 and previous config saved to /var/cache/conftool/dbconfig/20241015-081157-arnaudb.json [08:27:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T367781)', diff saved to https://phabricator.wikimedia.org/P69900 and previous config saved to /var/cache/conftool/dbconfig/20241015-082704-arnaudb.json [08:27:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance [08:27:08] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [08:27:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance [08:27:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: maintenance [08:27:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T367781)', diff saved to https://phabricator.wikimedia.org/P69901 and previous config saved to /var/cache/conftool/dbconfig/20241015-082727-arnaudb.json [08:27:33] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:27:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: maintenance [08:29:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T367781)', diff saved to https://phabricator.wikimedia.org/P69902 and previous config saved to /var/cache/conftool/dbconfig/20241015-082941-arnaudb.json [08:32:47] RESOLVED: HelmReleaseBadStatus: Helm release kube-system/calico on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:33:32] (03CR) 10Volans: [C:04-1] "There are quite few things to adjust here. The current implementation is wrong in some cases and uses pseudocode in others." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T377174) (owner: 10Arnaudb) [08:34:30] (03PS1) 10Brouberol: airflow: monitor the availability of the deployments [alerts] - 10https://gerrit.wikimedia.org/r/1080219 (https://phabricator.wikimedia.org/T377178) [08:36:30] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10227875 (10gmodena) >>! In T376014#10203183, @elukey wrote: I've been following this work from the trenches.... [08:44:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P69903 and previous config saved to /var/cache/conftool/dbconfig/20241015-084448-arnaudb.json [08:45:16] (03PS1) 10Giuseppe Lavagetto: Update with read-only support [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1080220 [08:45:26] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Update with read-only support [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1080220 (owner: 10Giuseppe Lavagetto) [08:46:50] !log oblivian@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: init - oblivian@cumin2002 [08:47:20] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: init - oblivian@cumin2002 [08:47:42] (03PS3) 10Arnaudb: mariadb: pii cleaner cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T377174) [08:49:49] (03PS4) 10Arnaudb: mariadb: pii cleaner cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T377174) [08:51:30] (03CR) 10Fabfur: [C:03+2] Renamed log fields for pipeline migration (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [08:51:32] (03PS5) 10Arnaudb: mariadb: pii cleaner cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T377174) [08:52:00] (03CR) 10Arnaudb: mariadb: pii cleaner cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T377174) (owner: 10Arnaudb) [08:52:32] (03PS1) 10Urbanecm: [Growth] beta: Lower batch size for reassignMenteesJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080223 (https://phabricator.wikimedia.org/T376124) [08:53:18] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: add article-country isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080111 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [08:54:38] (03Merged) 10jenkins-bot: ml-services: add article-country isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080111 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [08:58:55] 06SRE, 06Infrastructure-Foundations, 10netops: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10227934 (10cmooney) 05Open→03Resolved Closing this one, things have been ok since upgrade/reset. [08:59:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P69905 and previous config saved to /var/cache/conftool/dbconfig/20241015-085955-arnaudb.json [09:07:06] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [09:09:20] (03PS1) 10Kosta Harlan: temp accounts: Make temp accounts known on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080227 (https://phabricator.wikimedia.org/T376132) [09:12:42] (03PS11) 10Arnaudb: mariadb: pii cleaner cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T377174) [09:15:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T367781)', diff saved to https://phabricator.wikimedia.org/P69906 and previous config saved to /var/cache/conftool/dbconfig/20241015-091502-arnaudb.json [09:15:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance [09:15:06] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:15:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance [09:15:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [09:15:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [09:15:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [09:16:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [09:16:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2116.codfw.wmnet with reason: Maintenance [09:16:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2116.codfw.wmnet with reason: Maintenance [09:16:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T367781)', diff saved to https://phabricator.wikimedia.org/P69907 and previous config saved to /var/cache/conftool/dbconfig/20241015-091635-arnaudb.json [09:18:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T367781)', diff saved to https://phabricator.wikimedia.org/P69908 and previous config saved to /var/cache/conftool/dbconfig/20241015-091852-arnaudb.json [09:19:47] FIRING: HelmReleaseBadStatus: Helm release kube-system/calico on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:22:37] (03CR) 10Elukey: [C:03+1] dragonfly::dfdaemon: Enable by default when profile is included (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1080038 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:23:10] (03PS2) 10JMeybohm: dragonfly::dfdaemon: Enable by default when profile is included [puppet] - 10https://gerrit.wikimedia.org/r/1080038 (https://phabricator.wikimedia.org/T362408) [09:23:18] (03CR) 10JMeybohm: dragonfly::dfdaemon: Enable by default when profile is included (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1080038 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:26:19] !log brouberol@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker [09:27:04] (03PS1) 10Alexandros Kosiaris: ats: Route rest_v1/page/(html|title) to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1080232 (https://phabricator.wikimedia.org/T374683) [09:28:57] (03CR) 10CI reject: [V:04-1] ats: Route rest_v1/page/(html|title) to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1080232 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [09:30:30] (03CR) 10Elukey: "Left a nit, lemme know! In case you can proceed anyway if you don't feel it useful." [puppet] - 10https://gerrit.wikimedia.org/r/1080042 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:33:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P69909 and previous config saved to /var/cache/conftool/dbconfig/20241015-093359-arnaudb.json [09:41:16] (03CR) 10Clément Goubert: [C:03+1] containerd: Remove container log line length limit [puppet] - 10https://gerrit.wikimedia.org/r/1080071 (https://phabricator.wikimedia.org/T377132) (owner: 10JMeybohm) [09:46:14] (03CR) 10Elukey: [C:03+1] dragonfly::dfdaemon: Refactor docker integration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1080042 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:47:28] (03PS1) 10Ayounsi: Add reports for baremetal servers on legacy vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1080238 [09:47:28] (03PS2) 10Tiziano Fogli: kafka: remove mirror maker alerts from icinga [puppet] - 10https://gerrit.wikimedia.org/r/1078456 (https://phabricator.wikimedia.org/T370153) [09:48:27] (03CR) 10Ayounsi: "Currently live on Netbox next: https://netbox-next.wikimedia.org/extras/scripts/38/" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1080238 (owner: 10Ayounsi) [09:48:47] (03PS3) 10Tiziano Fogli: kafka: remove mirror maker alerts from icinga [puppet] - 10https://gerrit.wikimedia.org/r/1078456 (https://phabricator.wikimedia.org/T370153) [09:49:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P69910 and previous config saved to /var/cache/conftool/dbconfig/20241015-094906-arnaudb.json [09:49:08] (03CR) 10CI reject: [V:04-1] Add reports for baremetal servers on legacy vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1080238 (owner: 10Ayounsi) [09:52:24] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:52:32] (03CR) 10JMeybohm: [C:03+2] dragonfly::dfdaemon: Refactor docker integration [puppet] - 10https://gerrit.wikimedia.org/r/1080042 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:52:33] (03PS1) 10Ayounsi: Netbox: better logging for scripts import [puppet] - 10https://gerrit.wikimedia.org/r/1080240 [09:52:36] (03CR) 10JMeybohm: [C:03+2] dragonfly::dfdaemon: Enable by default when profile is included [puppet] - 10https://gerrit.wikimedia.org/r/1080038 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:52:40] (03PS1) 10Alexandros Kosiaris: rest-gateway: Route page and html RESTBase routes to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080241 (https://phabricator.wikimedia.org/T374683) [09:52:52] (03PS2) 10Ayounsi: Netbox: better logging for scripts import [puppet] - 10https://gerrit.wikimedia.org/r/1080240 [09:52:59] (03CR) 10JMeybohm: [V:03+2 C:03+2] dragonfly::dfdaemon: Refactor docker integration [puppet] - 10https://gerrit.wikimedia.org/r/1080042 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:54:20] (03PS2) 10Ayounsi: Add reports for baremetal servers on legacy vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1080238 [09:54:47] RESOLVED: HelmReleaseBadStatus: Helm release kube-system/calico on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:55:46] !log brouberol@cumin1002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:dse-k8s-worker [09:57:03] !log brouberol@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker [09:57:35] (03PS1) 10Gmodena: dse-k8s-services: content_history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080245 (https://phabricator.wikimedia.org/T368787) [09:59:26] (03PS7) 10JMeybohm: wikikube: Prepare clusters for containerd workers [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) [09:59:39] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:59:40] (03CR) 10Gmodena: "I've been testing a dev variant of this image on the -next deployment for a few days now, and it seems stable enough to warrant a release." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080245 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T1000) [10:04:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T367781)', diff saved to https://phabricator.wikimedia.org/P69911 and previous config saved to /var/cache/conftool/dbconfig/20241015-100413-arnaudb.json [10:04:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2130.codfw.wmnet with reason: Maintenance [10:04:17] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:04:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2130.codfw.wmnet with reason: Maintenance [10:04:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T367781)', diff saved to https://phabricator.wikimedia.org/P69912 and previous config saved to /var/cache/conftool/dbconfig/20241015-100435-arnaudb.json [10:06:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T367781)', diff saved to https://phabricator.wikimedia.org/P69913 and previous config saved to /var/cache/conftool/dbconfig/20241015-100652-arnaudb.json [10:08:07] (03PS1) 10STran: Give the `abusefilter-maintainer` group protected vars access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) [10:08:45] (03CR) 10CI reject: [V:04-1] Give the `abusefilter-maintainer` group protected vars access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [10:11:24] !log brouberol@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:dse-k8s-worker [10:11:47] (03PS1) 10Brouberol: cloudnative_pg: scope the WAL lag alert on the primary instance [alerts] - 10https://gerrit.wikimedia.org/r/1080251 (https://phabricator.wikimedia.org/T372284) [10:12:50] (03CR) 10Hnowlan: [C:03+1] "lgtm bar the CI commit nit" [puppet] - 10https://gerrit.wikimedia.org/r/1080232 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [10:12:53] (03CR) 10Volans: "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/1080240 (owner: 10Ayounsi) [10:13:08] (03CR) 10Brouberol: [C:03+2] cloudnative_pg: scope the WAL lag alert on the primary instance [alerts] - 10https://gerrit.wikimedia.org/r/1080251 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [10:14:38] !log brouberol@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [10:15:20] (03CR) 10Volans: "+1 for the idea, question inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1080238 (owner: 10Ayounsi) [10:18:14] * TheresNoTime looks at Gerrit.. [10:18:57] hrm yeah, timing out here [10:20:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:20:31] (03PS1) 10Btullis: Reduce the sensitivity of the anycast healthcheck for cephosd/radosgw [puppet] - 10https://gerrit.wikimedia.org/r/1080253 (https://phabricator.wikimedia.org/T376697) [10:21:00] !log brouberol@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [10:21:44] (yup, was slow for me for the past 5m, now just timing out) [10:21:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P69914 and previous config saved to /var/cache/conftool/dbconfig/20241015-102159-arnaudb.json [10:22:00] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host flink-zk2001.codfw.wmnet [10:22:57] (back?) [10:23:37] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4306/co" [puppet] - 10https://gerrit.wikimedia.org/r/1080253 (https://phabricator.wikimedia.org/T376697) (owner: 10Btullis) [10:25:29] (03PS1) 10Clément Goubert: Remove obsolete parsoid records [dns] - 10https://gerrit.wikimedia.org/r/1080254 [10:25:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:25:56] FIRING: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:25:58] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2001.codfw.wmnet [10:26:35] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host flink-zk2003.codfw.wmnet [10:28:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [10:29:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10228420 (10phaultfinder) [10:29:59] (03CR) 10Cathal Mooney: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1080253 (https://phabricator.wikimedia.org/T376697) (owner: 10Btullis) [10:29:59] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Route page and html RESTBase routes to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080241 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [10:30:26] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2003.codfw.wmnet [10:30:56] FIRING: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:32:15] (03PS1) 10Clément Goubert: Remove obsolete parsoid reference in doc [puppet] - 10https://gerrit.wikimedia.org/r/1080255 [10:32:30] (03CR) 10Btullis: [V:03+1 C:03+2] Reduce the sensitivity of the anycast healthcheck for cephosd/radosgw [puppet] - 10https://gerrit.wikimedia.org/r/1080253 (https://phabricator.wikimedia.org/T376697) (owner: 10Btullis) [10:33:44] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [10:34:10] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host flink-zk2002.codfw.wmnet [10:34:19] (03CR) 10Elukey: [C:03+1] Remove obsolete parsoid records [dns] - 10https://gerrit.wikimedia.org/r/1080254 (owner: 10Clément Goubert) [10:35:53] (03PS2) 10Clément Goubert: Remove obsolete parsoid records [dns] - 10https://gerrit.wikimedia.org/r/1080254 (https://phabricator.wikimedia.org/T359387) [10:35:56] RESOLVED: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:37:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P69915 and previous config saved to /var/cache/conftool/dbconfig/20241015-103706-arnaudb.json [10:38:00] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2002.codfw.wmnet [10:38:38] !log brouberol@cumin1002 START - Cookbook sre.presto.reboot-workers for Presto an-presto cluster: Reboot Presto nodes [10:40:25] (03CR) 10Clément Goubert: [C:03+2] Remove obsolete parsoid records [dns] - 10https://gerrit.wikimedia.org/r/1080254 (https://phabricator.wikimedia.org/T359387) (owner: 10Clément Goubert) [10:41:39] (03CR) 10Clément Goubert: [C:03+2] Remove obsolete parsoid reference in doc [puppet] - 10https://gerrit.wikimedia.org/r/1080255 (owner: 10Clément Goubert) [10:52:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T367781)', diff saved to https://phabricator.wikimedia.org/P69917 and previous config saved to /var/cache/conftool/dbconfig/20241015-105213-arnaudb.json [10:52:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2141.codfw.wmnet with reason: Maintenance [10:52:17] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:52:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2141.codfw.wmnet with reason: Maintenance [10:52:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2145.codfw.wmnet with reason: Maintenance [10:52:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2145.codfw.wmnet with reason: Maintenance [10:53:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T367781)', diff saved to https://phabricator.wikimedia.org/P69918 and previous config saved to /var/cache/conftool/dbconfig/20241015-105301-arnaudb.json [10:53:22] !log expand LVs on prometheus instances (k8s-mlserve and k8s-stagin) T377196 [10:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:26] T377196: Disk usage over threshold on some Prometheus instances - https://phabricator.wikimedia.org/T377196 [10:57:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T367781)', diff saved to https://phabricator.wikimedia.org/P69919 and previous config saved to /var/cache/conftool/dbconfig/20241015-105719-arnaudb.json [10:57:23] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:58:28] (03CR) 10Dreamy Jazz: Give the `abusefilter-maintainer` group protected vars access (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [11:01:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:01:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:01:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:01:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:01:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T376905)', diff saved to https://phabricator.wikimedia.org/P69920 and previous config saved to /var/cache/conftool/dbconfig/20241015-110132-ladsgroup.json [11:07:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance [11:07:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance [11:07:36] (03CR) 10Dreamy Jazz: temp accounts: Make temp accounts known on metawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080227 (https://phabricator.wikimedia.org/T376132) (owner: 10Kosta Harlan) [11:07:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T371742)', diff saved to https://phabricator.wikimedia.org/P69921 and previous config saved to /var/cache/conftool/dbconfig/20241015-110741-ladsgroup.json [11:07:45] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:09:04] (03PS2) 10STran: Give the `abusefilter-maintainer` group protected vars access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) [11:09:32] (03CR) 10STran: Give the `abusefilter-maintainer` group protected vars access (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [11:10:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T376905)', diff saved to https://phabricator.wikimedia.org/P69922 and previous config saved to /var/cache/conftool/dbconfig/20241015-111045-ladsgroup.json [11:12:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P69923 and previous config saved to /var/cache/conftool/dbconfig/20241015-111226-arnaudb.json [11:18:01] (03PS1) 10Btullis: Revert "Lower the number of slots that the enwiki dump uses" [puppet] - 10https://gerrit.wikimedia.org/r/1080265 (https://phabricator.wikimedia.org/T373904) [11:19:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:22:06] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10228569 (10aborrero) 05In progress→03Resolved I think we can consider this to be completed. We may reopen if required. [11:25:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P69924 and previous config saved to /var/cache/conftool/dbconfig/20241015-112551-ladsgroup.json [11:27:00] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: forward all VRF traffic without restrictions for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1080267 (https://phabricator.wikimedia.org/T374714) [11:27:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P69925 and previous config saved to /var/cache/conftool/dbconfig/20241015-112733-arnaudb.json [11:27:38] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: forward all VRF traffic without restrictions for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1080267 (https://phabricator.wikimedia.org/T374714) [11:28:23] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1080267 (https://phabricator.wikimedia.org/T374714) (owner: 10Arturo Borrero Gonzalez) [11:28:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T371742)', diff saved to https://phabricator.wikimedia.org/P69926 and previous config saved to /var/cache/conftool/dbconfig/20241015-112829-ladsgroup.json [11:28:33] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:28:59] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1080267 (https://phabricator.wikimedia.org/T374714) (owner: 10Arturo Borrero Gonzalez) [11:32:38] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: forward all VRF traffic without restrictions for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1080267 (https://phabricator.wikimedia.org/T374714) (owner: 10Arturo Borrero Gonzalez) [11:34:45] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:39:45] (03CR) 10Btullis: [C:03+2] Remove the dumps_store_load_average icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1079971 (https://phabricator.wikimedia.org/T374821) (owner: 10Btullis) [11:40:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P69927 and previous config saved to /var/cache/conftool/dbconfig/20241015-114059-ladsgroup.json [11:42:03] (03CR) 10Dreamy Jazz: Give the `abusefilter-maintainer` group protected vars access (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [11:42:15] (03PS3) 10Clément Goubert: Remove no longer used parsoid and api certs [puppet] - 10https://gerrit.wikimedia.org/r/1042936 (https://phabricator.wikimedia.org/T360636) (owner: 10Alexandros Kosiaris) [11:42:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T367781)', diff saved to https://phabricator.wikimedia.org/P69929 and previous config saved to /var/cache/conftool/dbconfig/20241015-114240-arnaudb.json [11:42:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2146.codfw.wmnet with reason: Maintenance [11:42:44] (03CR) 10Alexandros Kosiaris: [C:03+1] Remove no longer used parsoid and api certs [puppet] - 10https://gerrit.wikimedia.org/r/1042936 (https://phabricator.wikimedia.org/T360636) (owner: 10Alexandros Kosiaris) [11:42:44] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [11:42:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2146.codfw.wmnet with reason: Maintenance [11:43:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T367781)', diff saved to https://phabricator.wikimedia.org/P69930 and previous config saved to /var/cache/conftool/dbconfig/20241015-114302-arnaudb.json [11:43:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P69931 and previous config saved to /var/cache/conftool/dbconfig/20241015-114336-ladsgroup.json [11:44:22] (03CR) 10Alexandros Kosiaris: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080241 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [11:44:26] (03CR) 10Alexandros Kosiaris: [C:03+2] rest-gateway: Route page and html RESTBase routes to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080241 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [11:45:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T367781)', diff saved to https://phabricator.wikimedia.org/P69932 and previous config saved to /var/cache/conftool/dbconfig/20241015-114518-arnaudb.json [11:45:30] (03Merged) 10jenkins-bot: rest-gateway: Route page and html RESTBase routes to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080241 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [11:48:39] (03PS5) 10Tiziano Fogli: kafka: port mirror maker alerts to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) [11:50:48] (03CR) 10Tiziano Fogli: kafka: port mirror maker alerts to alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [11:56:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T376905)', diff saved to https://phabricator.wikimedia.org/P69933 and previous config saved to /var/cache/conftool/dbconfig/20241015-115606-ladsgroup.json [11:56:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:56:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:56:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T376905)', diff saved to https://phabricator.wikimedia.org/P69934 and previous config saved to /var/cache/conftool/dbconfig/20241015-115630-ladsgroup.json [11:56:50] (03PS2) 10Sergio Gimeno: GrowthExperiments: update stream configuration to capture user id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079475 (https://phabricator.wikimedia.org/T376833) (owner: 10Cyndywikime) [11:56:54] (03CR) 10Sergio Gimeno: [C:03+1] GrowthExperiments: update stream configuration to capture user id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079475 (https://phabricator.wikimedia.org/T376833) (owner: 10Cyndywikime) [11:57:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079475 (https://phabricator.wikimedia.org/T376833) (owner: 10Cyndywikime) [11:58:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P69935 and previous config saved to /var/cache/conftool/dbconfig/20241015-115842-ladsgroup.json [11:58:52] (03PS1) 10Kevin Bazira: ml-services: enable transparent proxy for article-country [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080268 (https://phabricator.wikimedia.org/T371897) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T1200) [12:00:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P69936 and previous config saved to /var/cache/conftool/dbconfig/20241015-120025-arnaudb.json [12:02:00] (03CR) 10Ayounsi: Netbox: better logging for scripts import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1080240 (owner: 10Ayounsi) [12:03:44] !log brouberol@cumin1002 END (FAIL) - Cookbook sre.presto.reboot-workers (exit_code=99) for Presto an-presto cluster: Reboot Presto nodes [12:04:47] (03PS3) 10STran: Give the `abusefilter-maintainer` group protected vars access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) [12:06:40] (03PS1) 10Michael Große: growthexperiments.pp: track dangling records for fr+eswiki hourly [puppet] - 10https://gerrit.wikimedia.org/r/1080270 (https://phabricator.wikimedia.org/T372337) [12:06:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T376905)', diff saved to https://phabricator.wikimedia.org/P69937 and previous config saved to /var/cache/conftool/dbconfig/20241015-120642-ladsgroup.json [12:07:53] (03CR) 10Ayounsi: Add reports for baremetal servers on legacy vlans (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1080238 (owner: 10Ayounsi) [12:13:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T371742)', diff saved to https://phabricator.wikimedia.org/P69938 and previous config saved to /var/cache/conftool/dbconfig/20241015-121349-ladsgroup.json [12:13:53] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:14:10] (03PS2) 10Kosta Harlan: temp accounts: Make temp accounts known on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080227 (https://phabricator.wikimedia.org/T376132) [12:14:11] (03CR) 10Kosta Harlan: temp accounts: Make temp accounts known on metawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080227 (https://phabricator.wikimedia.org/T376132) (owner: 10Kosta Harlan) [12:15:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P69939 and previous config saved to /var/cache/conftool/dbconfig/20241015-121532-arnaudb.json [12:16:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [12:17:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [12:17:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T371742)', diff saved to https://phabricator.wikimedia.org/P69940 and previous config saved to /var/cache/conftool/dbconfig/20241015-121706-ladsgroup.json [12:18:21] (03CR) 10Urbanecm: [C:03+1] "I don't quite like introducing two jobs here, but seems like useful enough to warrant that. LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1080270 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [12:19:12] if anyone can provide puppet-feedback on ^^^, would be appreciated [12:21:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P69941 and previous config saved to /var/cache/conftool/dbconfig/20241015-122149-ladsgroup.json [12:22:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1230.eqiad.wmnet with reason: Maintenance [12:22:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1230.eqiad.wmnet with reason: Maintenance [12:22:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T370903)', diff saved to https://phabricator.wikimedia.org/P69942 and previous config saved to /var/cache/conftool/dbconfig/20241015-122251-ladsgroup.json [12:22:56] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:23:08] (03PS4) 10STran: Apply wmf-specific protected vars rights access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) [12:23:52] (03PS1) 10Kosta Harlan: dumps: Mark globalblocks dir and script as absent [puppet] - 10https://gerrit.wikimedia.org/r/1080272 (https://phabricator.wikimedia.org/T376726) [12:24:04] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:24:18] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:24:23] (03CR) 10STran: Apply wmf-specific protected vars rights access (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [12:24:25] (03CR) 10CI reject: [V:04-1] dumps: Mark globalblocks dir and script as absent [puppet] - 10https://gerrit.wikimedia.org/r/1080272 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [12:26:00] (03PS6) 10Kosta Harlan: dumps: Drop the globalblocks table dump [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) [12:26:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T370903)', diff saved to https://phabricator.wikimedia.org/P69943 and previous config saved to /var/cache/conftool/dbconfig/20241015-122601-ladsgroup.json [12:26:06] (03CR) 10Kosta Harlan: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [12:26:35] (03CR) 10CI reject: [V:04-1] dumps: Drop the globalblocks table dump [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [12:29:56] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:30:04] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:30:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T367781)', diff saved to https://phabricator.wikimedia.org/P69944 and previous config saved to /var/cache/conftool/dbconfig/20241015-123039-arnaudb.json [12:30:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2153.codfw.wmnet with reason: Maintenance [12:30:43] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [12:30:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2153.codfw.wmnet with reason: Maintenance [12:31:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T367781)', diff saved to https://phabricator.wikimedia.org/P69945 and previous config saved to /var/cache/conftool/dbconfig/20241015-123101-arnaudb.json [12:31:33] (03PS7) 10Kosta Harlan: dumps: Drop the globalblocks table dump [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) [12:32:46] (03PS2) 10Kosta Harlan: dumps: Mark globalblocks dir and script as absent [puppet] - 10https://gerrit.wikimedia.org/r/1080272 (https://phabricator.wikimedia.org/T376726) [12:33:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T367781)', diff saved to https://phabricator.wikimedia.org/P69946 and previous config saved to /var/cache/conftool/dbconfig/20241015-123318-arnaudb.json [12:33:32] (03CR) 10Brouberol: [C:03+1] Revert "Lower the number of slots that the enwiki dump uses" [puppet] - 10https://gerrit.wikimedia.org/r/1080265 (https://phabricator.wikimedia.org/T373904) (owner: 10Btullis) [12:35:53] (03CR) 10Hashar: "Excellent!!" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079624 (https://phabricator.wikimedia.org/T374954) (owner: 10Ebomani) [12:36:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P69947 and previous config saved to /var/cache/conftool/dbconfig/20241015-123656-ladsgroup.json [12:39:04] (03PS1) 10Brouberol: airflo-analytics-test: define admin LDAP group -> role mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080273 (https://phabricator.wikimedia.org/T374948) [12:39:50] (03PS2) 10Brouberol: airflow-analytics-test: define admin LDAP group -> role mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080273 (https://phabricator.wikimedia.org/T374948) [12:41:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P69948 and previous config saved to /var/cache/conftool/dbconfig/20241015-124108-ladsgroup.json [12:46:12] !log brouberol@cumin1002 START - Cookbook sre.presto.reboot-workers for Presto an-presto cluster: Reboot Presto nodes [12:46:23] (03CR) 10Xcollazo: [C:03+1] Revert "Lower the number of slots that the enwiki dump uses" [puppet] - 10https://gerrit.wikimedia.org/r/1080265 (https://phabricator.wikimedia.org/T373904) (owner: 10Btullis) [12:47:51] (03CR) 10Stevemunene: [C:03+1] airflow-analytics-test: define admin LDAP group -> role mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080273 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [12:48:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P69949 and previous config saved to /var/cache/conftool/dbconfig/20241015-124825-arnaudb.json [12:50:44] (03CR) 10CDanis: [C:03+1] lists: Add ATS map for lists.wikimedia.org -> lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1072247 (owner: 10EoghanGaffney) [12:50:49] !log destroy old certs from puppetmaster1001's CA (parsoid.svc.{eqiad,codfw}.wmnet, debmonitor.discovery.wmnet) [12:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:53] !log brouberol@cumin1002 END (FAIL) - Cookbook sre.presto.reboot-workers (exit_code=99) for Presto an-presto cluster: Reboot Presto nodes [12:52:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T376905)', diff saved to https://phabricator.wikimedia.org/P69950 and previous config saved to /var/cache/conftool/dbconfig/20241015-125203-ladsgroup.json [12:52:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:52:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:56:12] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: enable transparent proxy for article-country [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080268 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [12:56:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P69951 and previous config saved to /var/cache/conftool/dbconfig/20241015-125615-ladsgroup.json [12:56:24] (03CR) 10Jforrester: [C:04-1] "Hmm. Will try to debug further, then." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079326 (owner: 10Jforrester) [12:57:24] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-presto1016.eqiad.wmnet [12:57:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:57:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:57:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T376905)', diff saved to https://phabricator.wikimedia.org/P69952 and previous config saved to /var/cache/conftool/dbconfig/20241015-125748-ladsgroup.json [12:58:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080035 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [12:58:39] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080275 (https://phabricator.wikimedia.org/T128546) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T1300). [13:00:05] Daimona, jan_drewniak, sergi0, and MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] 👋 [13:00:21] o/ [13:00:21] o/ [13:00:26] I can deploy today! [13:00:27] o/ [13:00:31] o/ [13:01:10] (03PS3) 10Daimona Eaytoy: [wikidatawiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079521 (https://phabricator.wikimedia.org/T375411) [13:01:12] (03CR) 10Urbanecm: [C:03+2] [wikidatawiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079521 (https://phabricator.wikimedia.org/T375411) (owner: 10Daimona Eaytoy) [13:01:39] (03PS3) 10Sergio Gimeno: GrowthExperiments: update stream configuration to capture user id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079475 (https://phabricator.wikimedia.org/T376833) (owner: 10Cyndywikime) [13:01:40] (03CR) 10Urbanecm: [C:03+2] GrowthExperiments: update stream configuration to capture user id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079475 (https://phabricator.wikimedia.org/T376833) (owner: 10Cyndywikime) [13:01:55] (03Merged) 10jenkins-bot: [wikidatawiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079521 (https://phabricator.wikimedia.org/T375411) (owner: 10Daimona Eaytoy) [13:02:13] o/ [13:02:23] (03Merged) 10jenkins-bot: GrowthExperiments: update stream configuration to capture user id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079475 (https://phabricator.wikimedia.org/T376833) (owner: 10Cyndywikime) [13:03:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P69953 and previous config saved to /var/cache/conftool/dbconfig/20241015-130332-arnaudb.json [13:03:39] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1016.eqiad.wmnet [13:04:08] urbanecm: I can deploy my portals patch myself when you're done [13:04:09] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1079521|[wikidatawiki] Enable the CampaignEvents extension (T375411)]], [[gerrit:1079475|GrowthExperiments: update stream configuration to capture user id (T376833)]] [13:04:14] T375411: Enable the CampaignEvents extension on Wikidata [target: Oct 15] - https://phabricator.wikimedia.org/T375411 [13:04:15] T376833: Update stream configuration to capture user id - https://phabricator.wikimedia.org/T376833 [13:04:18] jan_drewniak: sounds good, will ping you at the end then [13:04:52] (03PS1) 10Cathal Mooney: Authdns: add class to create zonefile snippets for K8s PTR delegation [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) [13:04:58] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-presto1017.eqiad.wmnet [13:05:28] (03CR) 10Jforrester: [C:03+1] Reduce number of bucketsizes for MediaViewer (group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079640 (https://phabricator.wikimedia.org/T372165) (owner: 10Simon04) [13:06:47] (03CR) 10CI reject: [V:04-1] Authdns: add class to create zonefile snippets for K8s PTR delegation [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) (owner: 10Cathal Mooney) [13:06:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T376905)', diff saved to https://phabricator.wikimedia.org/P69954 and previous config saved to /var/cache/conftool/dbconfig/20241015-130647-ladsgroup.json [13:07:15] @Daimona I think we're good [13:08:32] it's still pulling changes to debug [13:09:01] (03PS2) 10Cathal Mooney: Authdns: add class to create zonefile snippets for K8s PTR delegation [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) [13:09:03] hi, can i add a patch to the window still? [13:09:27] MatmaRex: sure! [13:09:31] add it to the calendar please [13:10:22] (03CR) 10Urbanecm: [C:04-1] Apply wmf-specific protected vars rights access (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [13:10:57] (03CR) 10CI reject: [V:04-1] Authdns: add class to create zonefile snippets for K8s PTR delegation [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) (owner: 10Cathal Mooney) [13:11:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078907 (https://phabricator.wikimedia.org/T376786) (owner: 10Mhorsey) [13:11:04] (03PS1) 10Bartosz Dziewoński: SkinComponentCopyright: Fix message existence check for history-copyright [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080279 (https://phabricator.wikimedia.org/T45646) [13:11:06] !log urbanecm@deploy2002 cyndywikime, daimona, urbanecm: Backport for [[gerrit:1079521|[wikidatawiki] Enable the CampaignEvents extension (T375411)]], [[gerrit:1079475|GrowthExperiments: update stream configuration to capture user id (T376833)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:11:11] finally [13:11:11] T375411: Enable the CampaignEvents extension on Wikidata [target: Oct 15] - https://phabricator.wikimedia.org/T375411 [13:11:12] T376833: Update stream configuration to capture user id - https://phabricator.wikimedia.org/T376833 [13:11:12] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1017.eqiad.wmnet [13:11:19] Daimona: sergi0: can you test your changes, please? [13:11:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T370903)', diff saved to https://phabricator.wikimedia.org/P69955 and previous config saved to /var/cache/conftool/dbconfig/20241015-131122-ladsgroup.json [13:11:26] sure [13:11:27] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:02] (03CR) 10Urbanecm: [C:03+2] SkinComponentCopyright: Fix message existence check for history-copyright [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080279 (https://phabricator.wikimedia.org/T45646) (owner: 10Bartosz Dziewoński) [13:12:33] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-presto1018.eqiad.wmnet [13:12:35] added. thanks [13:12:45] (03CR) 10Urbanecm: [C:04-1] Apply wmf-specific protected vars rights access (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [13:12:51] (03CR) 10Dreamy Jazz: Apply wmf-specific protected vars rights access (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [13:13:07] (03PS3) 10Cathal Mooney: Authdns: add class to create zonefile snippets for K8s PTR delegation [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) [13:13:21] (03CR) 10Dreamy Jazz: Apply wmf-specific protected vars rights access (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [13:13:33] So, wikidata looks good. I should point out that the patch also creates the event-organizer group on testwikidata, where the extension is not enabled. That should be harmless though, and we might consider enabling the extension on testwikidata later. [13:14:38] (03CR) 10Urbanecm: [C:04-1] Apply wmf-specific protected vars rights access (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [13:15:09] All good from my end [13:15:31] Daimona: right, it changes `wikidata` group, not wikidatawiki. seems like not harmful, so we can proceed (and if needed, it can be fixed later). sounds good? [13:15:34] (03CR) 10Dreamy Jazz: Apply wmf-specific protected vars rights access (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [13:16:09] yep, totally! I thought it'd make sense to keep the same config for wikidata and testwikidata regardless. [13:16:42] generally, yeah, but when the extension is not enabled... [13:16:44] anyway, proceeding [13:16:46] thanks sergi0 for confirming [13:16:47] !log urbanecm@deploy2002 cyndywikime, daimona, urbanecm: Continuing with sync [13:17:05] (03PS2) 10Michael Große: eswiki: switch clearing link recommendations to PageSaveComplete hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080035 (https://phabricator.wikimedia.org/T372337) [13:17:09] (03CR) 10Urbanecm: [C:03+2] eswiki: switch clearing link recommendations to PageSaveComplete hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080035 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [13:18:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T367781)', diff saved to https://phabricator.wikimedia.org/P69956 and previous config saved to /var/cache/conftool/dbconfig/20241015-131839-arnaudb.json [13:18:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2170.codfw.wmnet with reason: Maintenance [13:18:51] (03Merged) 10jenkins-bot: eswiki: switch clearing link recommendations to PageSaveComplete hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080035 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [13:18:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2170.codfw.wmnet with reason: Maintenance [13:18:57] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:19:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T367781)', diff saved to https://phabricator.wikimedia.org/P69957 and previous config saved to /var/cache/conftool/dbconfig/20241015-131901-arnaudb.json [13:19:06] I don't think my config change can be meaningfully tested beyond checking for errors. [13:19:10] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1018.eqiad.wmnet [13:19:11] agreed [13:19:16] We'll see its effects in our tracking over time. [13:19:31] 06SRE, 06Infrastructure-Foundations, 06Traffic, 13Patch-For-Review: Authdns: automate reverse DNS zone delegation for k8s pod IP ranges - https://phabricator.wikimedia.org/T376291#10229001 (10cmooney) The above patch is my current best-stab at accomplishing this. I won't have a huge amount of time to look... [13:19:32] i'd want to ship it with the other change, but CI says 25m eta [13:19:37] so i'll just sync it alone [13:20:38] (03PS4) 10Cathal Mooney: Authdns: add class to create zonefile snippets for K8s PTR delegation [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) [13:21:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T367781)', diff saved to https://phabricator.wikimedia.org/P69958 and previous config saved to /var/cache/conftool/dbconfig/20241015-132117-arnaudb.json [13:21:18] urbanecm: is it fine to sneak in a config change? [13:21:27] zabe: sure, which one? [13:21:37] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1079520 [13:21:41] (sure as in "happy to deploy it for you", not "feel free to take over deploy2002") [13:21:43] (03CR) 10Dreamy Jazz: Apply wmf-specific protected vars rights access (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [13:21:44] I can do it myself if you want [13:21:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P69959 and previous config saved to /var/cache/conftool/dbconfig/20241015-132154-ladsgroup.json [13:22:26] zabe: i'd prefer sneaking it with other changes i'm deploying, to save time [13:22:37] alright [13:22:42] then feel free:) [13:22:45] (03PS2) 10Zabe: s7: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079520 (https://phabricator.wikimedia.org/T183490) [13:22:48] (03CR) 10Urbanecm: [C:03+2] s7: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079520 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [13:23:07] (03PS5) 10Cathal Mooney: Authdns: add class to create zonefile snippets for K8s PTR delegation [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) [13:23:35] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1079521|[wikidatawiki] Enable the CampaignEvents extension (T375411)]], [[gerrit:1079475|GrowthExperiments: update stream configuration to capture user id (T376833)]] (duration: 19m 25s) [13:23:40] T375411: Enable the CampaignEvents extension on Wikidata [target: Oct 15] - https://phabricator.wikimedia.org/T375411 [13:23:40] T376833: Update stream configuration to capture user id - https://phabricator.wikimedia.org/T376833 [13:23:44] (03Merged) 10jenkins-bot: s7: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079520 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [13:24:23] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1080035|eswiki: switch clearing link recommendations to PageSaveComplete hook (T372337)]], [[gerrit:1079520|s7: Reduce revision-slots cache expiry to 60 seconds (T183490)]] [13:24:33] T372337: High number of dangling search index results at fr.wikipedia or it.wikipedia - https://phabricator.wikimedia.org/T372337 [13:24:33] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [13:26:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:38] !log urbanecm@deploy2002 migr, urbanecm, zabe: Backport for [[gerrit:1080035|eswiki: switch clearing link recommendations to PageSaveComplete hook (T372337)]], [[gerrit:1079520|s7: Reduce revision-slots cache expiry to 60 seconds (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:26:50] zabe: if you want to do something while at debug [13:26:52] otherwise i can sync [13:27:16] no, its only changing the ttl, not really testable [13:27:30] !log urbanecm@deploy2002 migr, urbanecm, zabe: Continuing with sync [13:27:36] thought so. proceeding [13:27:56] (03CR) 10Btullis: [C:03+2] Revert "Lower the number of slots that the enwiki dump uses" [puppet] - 10https://gerrit.wikimedia.org/r/1080265 (https://phabricator.wikimedia.org/T373904) (owner: 10Btullis) [13:29:40] (03PS6) 10Cathal Mooney: Authdns: add class to create zonefile snippets for K8s PTR delegation [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) [13:31:20] (03CR) 10Kevin Bazira: [C:03+2] ml-services: enable transparent proxy for article-country [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080268 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [13:32:07] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1080035|eswiki: switch clearing link recommendations to PageSaveComplete hook (T372337)]], [[gerrit:1079520|s7: Reduce revision-slots cache expiry to 60 seconds (T183490)]] (duration: 07m 44s) [13:32:16] zabe: MichaelG_WMF: done [13:32:22] thanks :) [13:32:23] T372337: High number of dangling search index results at fr.wikipedia or it.wikipedia - https://phabricator.wikimedia.org/T372337 [13:32:23] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [13:32:28] Thanks! [13:32:57] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) (owner: 10Cathal Mooney) [13:33:09] np [13:33:18] still eta 10m... [13:34:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080279 (https://phabricator.wikimedia.org/T45646) (owner: 10Bartosz Dziewoński) [13:34:08] (03Merged) 10jenkins-bot: ml-services: enable transparent proxy for article-country [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080268 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [13:36:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P69960 and previous config saved to /var/cache/conftool/dbconfig/20241015-133624-arnaudb.json [13:37:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P69961 and previous config saved to /var/cache/conftool/dbconfig/20241015-133701-ladsgroup.json [13:40:18] (03PS7) 10Cathal Mooney: Authdns: add class to create zonefile snippets for K8s PTR delegation [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) [13:45:37] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) (owner: 10Cathal Mooney) [13:48:31] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 06Traffic: Commons' file is inaccessible for some users - https://phabricator.wikimedia.org/T377202#10229170 (10Bugreporter) [13:48:34] !log herron@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet [13:49:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10229171 (10phaultfinder) [13:50:23] (03Merged) 10jenkins-bot: SkinComponentCopyright: Fix message existence check for history-copyright [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080279 (https://phabricator.wikimedia.org/T45646) (owner: 10Bartosz Dziewoński) [13:50:50] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1080279|SkinComponentCopyright: Fix message existence check for history-copyright (T45646)]] [13:51:05] T45646: "MediaWiki:Copyright" message allows raw HTML - https://phabricator.wikimedia.org/T45646 [13:51:18] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-presto1019.eqiad.wmnet [13:51:23] MatmaRex: finally [13:51:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P69962 and previous config saved to /var/cache/conftool/dbconfig/20241015-135131-arnaudb.json [13:52:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T376905)', diff saved to https://phabricator.wikimedia.org/P69963 and previous config saved to /var/cache/conftool/dbconfig/20241015-135208-ladsgroup.json [13:52:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T371742)', diff saved to https://phabricator.wikimedia.org/P69964 and previous config saved to /var/cache/conftool/dbconfig/20241015-135213-ladsgroup.json [13:52:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:52:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:52:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T376905)', diff saved to https://phabricator.wikimedia.org/P69965 and previous config saved to /var/cache/conftool/dbconfig/20241015-135234-ladsgroup.json [13:53:01] !log urbanecm@deploy2002 urbanecm, matmarex: Backport for [[gerrit:1080279|SkinComponentCopyright: Fix message existence check for history-copyright (T45646)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:53:01] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:53:16] urbanecm: thanks. looking [13:55:56] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet [13:56:47] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: define admin LDAP group -> role mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080273 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [13:57:32] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1019.eqiad.wmnet [13:58:09] sorry, still testing [13:58:25] i'm not sure if it's working as it should, eh [14:00:46] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-presto1020.eqiad.wmnet [14:00:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T376905)', diff saved to https://phabricator.wikimedia.org/P69966 and previous config saved to /var/cache/conftool/dbconfig/20241015-140045-ladsgroup.json [14:01:19] urbanecm: i think this is fine to deploy, it doesn't break anything, but it doesn't fix my issue either. there's some weird interaction with a hook on wikimedia wikis [14:04:19] 10SRE-swift-storage, 06Commons, 06Traffic: Commons' file is inaccessible for some users - https://phabricator.wikimedia.org/T377202#10229245 (10Aklapper) Removing #mediawiki-file-management as I doubt that this is a bug in MediaWiki core code. I get a `404` error here (Central Europe). [14:04:21] either that, or i was confuse by some caching [14:04:34] because now it seems to work as i wanted, after i tested on another page [14:05:10] !log btullis@cumin1002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling reboot on A:schema [14:06:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T367781)', diff saved to https://phabricator.wikimedia.org/P69967 and previous config saved to /var/cache/conftool/dbconfig/20241015-140638-arnaudb.json [14:06:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2173.codfw.wmnet with reason: Maintenance [14:06:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2173.codfw.wmnet with reason: Maintenance [14:06:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:07:01] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1020.eqiad.wmnet [14:07:09] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:07:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:07:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T367781)', diff saved to https://phabricator.wikimedia.org/P69968 and previous config saved to /var/cache/conftool/dbconfig/20241015-140716-arnaudb.json [14:07:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P69969 and previous config saved to /var/cache/conftool/dbconfig/20241015-140726-ladsgroup.json [14:07:59] (03CR) 10Dreamy Jazz: [C:03+1] temp accounts: Make temp accounts known on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080227 (https://phabricator.wikimedia.org/T376132) (owner: 10Kosta Harlan) [14:08:58] !log herron@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [14:09:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T367781)', diff saved to https://phabricator.wikimedia.org/P69970 and previous config saved to /var/cache/conftool/dbconfig/20241015-140932-arnaudb.json [14:10:28] urbanecm: still there? [14:15:10] (03PS1) 10Brouberol: cloudnative_pg: add a cookbook to investigate wal archive issues [alerts] - 10https://gerrit.wikimedia.org/r/1080293 (https://phabricator.wikimedia.org/T372284) [14:15:24] (03CR) 10Clément Goubert: [C:03+2] Remove no longer used parsoid and api certs [puppet] - 10https://gerrit.wikimedia.org/r/1042936 (https://phabricator.wikimedia.org/T360636) (owner: 10Alexandros Kosiaris) [14:15:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P69971 and previous config saved to /var/cache/conftool/dbconfig/20241015-141552-ladsgroup.json [14:15:56] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [14:16:13] hey MatmaRex, I haven't been paying much attention to the backport window, but I still have to deploy a portals change. Do you need a change to be deployed still? [14:16:33] (03CR) 10Brouberol: [C:03+2] cloudnative_pg: add a cookbook to investigate wal archive issues [alerts] - 10https://gerrit.wikimedia.org/r/1080293 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [14:17:02] jan_drewniak: yeah, my backport has been synced to test servers by urbanecm, but then something interrupted him [14:17:38] !log herron@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan2002.codfw.wmnet [14:18:13] MatmaRex: ok, I can take over from there then. It's good to sync? [14:18:46] jan_drewniak: yep [14:19:36] MatmaRex: sorry, i was waiting on ci and got refocused [14:19:38] !log urbanecm@deploy2002 urbanecm, matmarex: Continuing with sync [14:19:40] proceeding [14:20:06] jan_drewniak: once scap finishes, feel free to take over [14:20:09] thanks urbanecm, looks like I wouldn't be able to help anyway: `14:18:51 backport is locked by urbanecm (pid 1405129) on Tue Oct 15 13:33:59 2024` [14:21:43] !log btullis@cumin1002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling reboot on A:schema [14:22:28] (03CR) 10JHathaway: [C:03+2] realm.pp: drop namservers global as it is no longer used [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:22:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P69972 and previous config saved to /var/cache/conftool/dbconfig/20241015-142233-ladsgroup.json [14:24:13] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1080279|SkinComponentCopyright: Fix message existence check for history-copyright (T45646)]] (duration: 33m 23s) [14:24:25] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2002.codfw.wmnet [14:24:28] T45646: "MediaWiki:Copyright" message allows raw HTML - https://phabricator.wikimedia.org/T45646 [14:24:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P69973 and previous config saved to /var/cache/conftool/dbconfig/20241015-142439-arnaudb.json [14:26:29] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:26:37] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:26:42] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:26:55] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:26:58] (03PS1) 10Clément Goubert: Remove obsolete api records [dns] - 10https://gerrit.wikimedia.org/r/1080295 [14:26:58] !log herron@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan1002.eqiad.wmnet [14:27:16] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:31:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P69974 and previous config saved to /var/cache/conftool/dbconfig/20241015-143059-ladsgroup.json [14:31:44] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host matomo1003.eqiad.wmnet [14:33:00] MatmaRex: done [14:33:03] (03PS1) 10Alexandros Kosiaris: rest-gateway: Skip project for /w/rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080301 (https://phabricator.wikimedia.org/T374683) [14:33:08] jan_drewniak: feel free to take over, sorry to delay [14:33:19] urbanecm: no problem! [14:33:30] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1002.eqiad.wmnet [14:33:43] thanks urbanecm! [14:33:44] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080275 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:34:47] (03CR) 10Alexandros Kosiaris: [C:03+2] rest-gateway: Skip project for /w/rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080301 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:34:51] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080275 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:35:24] (03PS2) 10Alexandros Kosiaris: ats: Route rest_v1/page/(html|title) to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1080232 (https://phabricator.wikimedia.org/T374683) [14:35:33] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Skip project for /w/rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080301 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:35:39] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host matomo1003.eqiad.wmnet [14:35:52] !log herron@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan2001.codfw.wmnet [14:35:53] (03Merged) 10jenkins-bot: rest-gateway: Skip project for /w/rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080301 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:37:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T371742)', diff saved to https://phabricator.wikimedia.org/P69975 and previous config saved to /var/cache/conftool/dbconfig/20241015-143740-ladsgroup.json [14:37:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [14:37:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [14:38:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T371742)', diff saved to https://phabricator.wikimedia.org/P69976 and previous config saved to /var/cache/conftool/dbconfig/20241015-143803-ladsgroup.json [14:38:10] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:39:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P69977 and previous config saved to /var/cache/conftool/dbconfig/20241015-143946-arnaudb.json [14:40:08] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1080240 (owner: 10Ayounsi) [14:43:17] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 06m 46s) [14:43:49] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [14:44:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:44:58] jouncebot: nowandnext [14:44:58] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [14:44:58] In 0 hour(s) and 15 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T1500) [14:45:15] I'm not deploying anything, just doing switchover of s3 [14:45:42] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 02m 24s) [14:46:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T376905)', diff saved to https://phabricator.wikimedia.org/P69978 and previous config saved to /var/cache/conftool/dbconfig/20241015-144606-ladsgroup.json [14:46:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [14:46:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [14:46:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T376905)', diff saved to https://phabricator.wikimedia.org/P69979 and previous config saved to /var/cache/conftool/dbconfig/20241015-144631-ladsgroup.json [14:46:54] (03CR) 10Volans: [C:03+1] "I haven't tested it but looks sane." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1080238 (owner: 10Ayounsi) [14:47:40] (03Abandoned) 10Tiziano Fogli: logstash: stripping containerd prefix [puppet] - 10https://gerrit.wikimedia.org/r/1080047 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [14:47:51] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2001.codfw.wmnet [14:48:27] !log herron@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan1001.eqiad.wmnet [14:49:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:14] (03CR) 10JMeybohm: "Should be a noop apart from the nrpe_check_disk_options" [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:52:02] (03CR) 10Ahmon Dancy: [C:03+1] hieradata: convert remaining mw_releases entries [puppet] - 10https://gerrit.wikimedia.org/r/1077482 (https://phabricator.wikimedia.org/T370934) (owner: 10Scott French) [14:54:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T376905)', diff saved to https://phabricator.wikimedia.org/P69980 and previous config saved to /var/cache/conftool/dbconfig/20241015-145441-ladsgroup.json [14:54:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T367781)', diff saved to https://phabricator.wikimedia.org/P69981 and previous config saved to /var/cache/conftool/dbconfig/20241015-145453-arnaudb.json [14:54:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2174.codfw.wmnet with reason: Maintenance [14:55:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2174.codfw.wmnet with reason: Maintenance [14:55:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T367781)', diff saved to https://phabricator.wikimedia.org/P69982 and previous config saved to /var/cache/conftool/dbconfig/20241015-145517-arnaudb.json [14:55:30] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:56:35] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1001.eqiad.wmnet [14:57:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T367781)', diff saved to https://phabricator.wikimedia.org/P69983 and previous config saved to /var/cache/conftool/dbconfig/20241015-145734-arnaudb.json [15:00:05] eoghan, jelto, arnoldokoth, and mutante: That opportune time for a SRE Collaboration Services office hours deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T1500). [15:01:12] FIRING: [2x] ProbeDown: Service aux-k8s-ctrl1003:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:01:28] * volans looking [15:01:31] !incidents [15:01:32] 5321 (UNACKED) [2x] ProbeDown sre (aux-k8s-ctrl1003:6443 probes/custom eqiad) [15:01:36] !ack 5321 [15:01:37] 5321 (ACKED) [2x] ProbeDown sre (aux-k8s-ctrl1003:6443 probes/custom eqiad) [15:01:51] anyone working on aux-k8s-ctrl1003? [15:02:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:35] I'm unable to connect, checking remote console [15:02:48] (03CR) 10Elukey: [C:03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [15:03:18] it's a VM, isn't it? [15:03:30] what's up? [15:03:34] or not-up as the case may be [15:03:50] akosiaris: yep, just realized [15:04:08] FIRING: KubernetesCalicoDown: aux-k8s-ctrl1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-aux&var-instance=aux-k8s-ctrl1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:04:08] it is a VM yes, I added it some days ago [15:04:10] 1002 is up btw, just fine. no degradation of service for now [15:04:13] I just lost access to a ganeti VM, in case taht's useful [15:04:21] akosiaris: do we need to depool it or something? [15:04:23] or is automatic [15:04:27] cloudcumin1001.eqiad.wmnet [15:04:28] volans: it's LVS [15:04:36] oh, maybe we are losing a ganeti node? [15:04:57] both in ganeti D [15:05:00] in eqiad [15:05:48] I'm trying to see the cluster status, master is ganeti1028 [15:05:55] so far I'm unable to list vms [15:06:03] yeah, same [15:06:08] seems on ganeti1032.eqiad.wmnet [15:06:16] it is slow but it works [15:06:17] it took very long [15:06:22] now I got the answer [15:06:39] aux-k8s-ctrl1003.eqiad.wmnet kvm debootstrap+default ganeti1034.eqiad.wmnet running 4.0G [15:06:40] cloudcumin1001.eqiad.wmnet kvm debootstrap+default ganeti1034.eqiad.wmnet running 1.0G [15:06:43] same node for both [15:06:46] to be clear, this is non-impacting (yet) for wiki users, right? [15:07:01] it is not impacting [15:07:05] ok thanks! [15:07:13] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:14] I got the ganeti node wrong then [15:07:18] :( [15:07:44] node doesn't feel well [15:08:01] but I can't quantify it [15:08:10] cpu is low, memory is ok [15:08:15] oh, maybe network [15:09:10] drbd [15:09:19] We did not send a P_BARRIER for 600972ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked? [15:09:30] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10229749 (10Jhancock.wm) [15:09:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P69984 and previous config saved to /var/cache/conftool/dbconfig/20241015-150948-ladsgroup.json [15:10:06] (03PS1) 10Papaul: Update frack switches ip addresses [puppet] - 10https://gerrit.wikimedia.org/r/1080311 (https://phabricator.wikimedia.org/T374587) [15:10:37] yeah, network seems ok but traffic has gone down to unhealth levels for ganeti node with 19 VMs https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=now-1h&to=now&var-server=ganeti1034&var-datasource=thanos&var-cluster=ganeti&viewPanel=8 [15:10:45] maybe disk failures? [15:11:01] oh wait, md2 is resyncing [15:11:09] at a glacial speed [15:11:21] akosiaris: varius disks have NetworkFailure in /proc/drbd [15:11:24] FIRING: SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:24] first entry is at Oct 15 14:58:34 ganeti1034 kernel: drbd resource2: meta connection shut down by peer. [15:12:33] (03CR) 10BBlack: [C:03+1] varnish: add pediapress.com to allowed maps domains [puppet] - 10https://gerrit.wikimedia.org/r/1078994 (https://phabricator.wikimedia.org/T375761) (owner: 10Ssingh) [15:12:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P69985 and previous config saved to /var/cache/conftool/dbconfig/20241015-151243-arnaudb.json [15:13:36] volans: yeah, I see md2_resync being blocked for large amounts of time, drbd resources failing [15:13:42] this isn't looking good [15:14:18] I think I wanna try and empty it from as many VMs as possible now that there is still a chance [15:14:20] objections? [15:14:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:50] akosiaris: +1, we have sre.ganeti.drain-node IIRC [15:15:12] if we're lucly that there aren't no-drbd instances on it [15:15:25] if that's the same drbd failure mode as T348730 then you will probably have to forcibly reboot the metal [15:15:26] T348730: DRBD kernel error on ganeti2031 led to kernel hang - https://phabricator.wikimedia.org/T348730 [15:16:59] !log akosiaris@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [15:17:19] !log drain ganeti1034 of VMs, hardware might be misbehaving [15:17:20] cdanis: it looks similar enough at first sight, but I didn't check all the call traces [15:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:24] (03CR) 10Papaul: [C:03+2] Update frack switches ip addresses [puppet] - 10https://gerrit.wikimedia.org/r/1080311 (https://phabricator.wikimedia.org/T374587) (owner: 10Papaul) [15:17:58] volans: yeah just saying, there's an argument to be made for just rebooting via mgmt interface immediately, nothing else is likely to work, and it's also likely all the VMs hosted there are basically hung right now and can't be drained [15:18:09] yeah I was wondering the same [15:18:12] FYI very recently I created a prometheus script to alert on kernel panics/taints, see modules/prometheus/files/usr/local/bin/prometheus-node-kernel-panic.sh and https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=All [15:18:12] akosiaris: ^^^ [15:18:45] FIRING: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:19:06] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1034.eqiad.wmnet [15:19:14] well, it failed [15:19:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:01] I can force a reboot for sure [15:20:11] akosiaris: I'm already in the mgmt of ganeti1034 if you want [15:20:23] just noting VMs [15:20:39] *if you want me to do it [15:20:56] it's aphlict1002.eqiad.wmnet, acmechief1002.eqiad.wmnet, schema1004.eqiad.wmnet, cloudcumin1001.eqiad.wmnet, aux-k8s-ctrl1003.eqiad.wmnet, aux-k8s-etcd1004.eqiad.wmnet [15:20:59] all stuck in D state [15:21:40] volans: go for it [15:22:11] !log force-rebooting ganeti1034 stuck due to drbd traces via mgmt [15:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:14] spicerack.ganeti.GanetiError: Error while performing request to RAPI [15:22:23] timeouts fwiw [15:22:40] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 06Traffic: Commons' file is inaccessible for some users - https://phabricator.wikimedia.org/T377202#10229798 (10MatthewVernon) The problem is that this object has been uploaded to codfw OK, but not eqiad; all being equal, this will get picked up... [15:22:50] yeah probably the master cannot perform the action because the node is not responding to it [15:23:32] seeing boot first things in console [15:23:34] cdanis: good way to test the new aux VMs :D [15:23:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:24:55] akosiaris: rebooted, at login [15:24:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P69986 and previous config saved to /var/cache/conftool/dbconfig/20241015-152456-ladsgroup.json [15:24:59] nothing red during boot process [15:25:51] drbd disks are in cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent state [15:26:09] the ones it was secondary for are just connected back [15:26:11] and up to date [15:26:38] !log run gnt-cluster verify-disks after ganeti1034 forceful reboot [15:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:46] no disks need to be activated [15:26:48] syncs are around 50% right now [15:26:51] (03PS1) 10Jelto: gerrit: lower nftables_throttling::max_connections to 25 [puppet] - 10https://gerrit.wikimedia.org/r/1080313 (https://phabricator.wikimedia.org/T365259) [15:27:11] all 6 VMs running [15:27:13] RESOLVED: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:27:26] and resolutions coming in [15:27:33] nice [15:27:35] arturo: test cloudcumin1001 plz? [15:27:44] akosiaris: I'm inside the VM now [15:27:44] sync almost completed for all primary disks [15:27:45] overall LGTM [15:27:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P69987 and previous config saved to /var/cache/conftool/dbconfig/20241015-152749-arnaudb.json [15:28:28] FIRING: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [15:28:35] akosiaris: LGTM [15:28:49] all disks UpToDate/UpToDate [15:28:50] cool, thanks [15:28:57] sync completed [15:29:08] RESOLVED: KubernetesCalicoDown: aux-k8s-ctrl1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-aux&var-instance=aux-k8s-ctrl1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:29:17] mdstat seems happy too [15:30:58] I think we can call it resolved for now? [15:31:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077504 (https://phabricator.wikimedia.org/T376065) (owner: 10Kimberly Sarabia) [15:31:12] RESOLVED: [2x] ProbeDown: Service aux-k8s-ctrl1003:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:32:17] 06SRE, 06Infrastructure-Foundations: DRBD kernel error on ganeti2031 led to kernel hang - https://phabricator.wikimedia.org/T348730#10229827 (10akosiaris) Probably happened once more on ganeti1034 today. node is still bullseye fwiw. [15:33:43] 06SRE, 06Infrastructure-Foundations: DRBD kernel error on ganeti2031 led to kernel hang - https://phabricator.wikimedia.org/T348730#10229835 (10Volans) A forced reboot via mgmt seems to have put back in a working state for now. [15:34:29] thanks akosiaris volans [15:35:17] (03CR) 10BryanDavis: Account blocking: Publically available log of all block and unblocks. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1079470 (https://phabricator.wikimedia.org/T376991) (owner: 10Slyngshede) [15:38:28] FIRING: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [15:38:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:40:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T376905)', diff saved to https://phabricator.wikimedia.org/P69988 and previous config saved to /var/cache/conftool/dbconfig/20241015-154002-ladsgroup.json [15:40:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:40:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:40:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T376905)', diff saved to https://phabricator.wikimedia.org/P69989 and previous config saved to /var/cache/conftool/dbconfig/20241015-154027-ladsgroup.json [15:41:24] RESOLVED: SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:41:28] arturo: I've re-armed keyholder on cloudcumin1002 [15:41:31] *1001 [15:41:38] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s3 T377164 [15:41:39] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Port defs_from_etcd logic to nftables - https://phabricator.wikimedia.org/T348734#10229866 (10Jelto) Importing `abuse/blocked_nets` to a host was a useful feature that we really miss on Gerrit right now. In the past, we used... [15:41:45] (03CR) 10Ayounsi: [C:03+2] Add reports for baremetal servers on legacy vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1080238 (owner: 10Ayounsi) [15:42:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 T377164 [15:42:06] volans: how do you know it was required to rearm? I could not find any error message ... [15:42:28] volans: thanks, I was finding our cookbooks were not working properly, and really having a hard time explaining why [15:42:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set db2209 with weight 0 T377164', diff saved to https://phabricator.wikimedia.org/P69990 and previous config saved to /var/cache/conftool/dbconfig/20241015-154228-ladsgroup.json [15:42:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T367781)', diff saved to https://phabricator.wikimedia.org/P69991 and previous config saved to /var/cache/conftool/dbconfig/20241015-154256-arnaudb.json [15:42:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2176.codfw.wmnet with reason: Maintenance [15:43:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2176.codfw.wmnet with reason: Maintenance [15:43:16] arturo: FIRING: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on [15:43:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T367781)', diff saved to https://phabricator.wikimedia.org/P69992 and previous config saved to /var/cache/conftool/dbconfig/20241015-154318-arnaudb.json [15:43:29] they should work now [15:43:35] T377164: Switchover s3 master (db2205 -> db2209) - https://phabricator.wikimedia.org/T377164 [15:43:37] (03Merged) 10jenkins-bot: Add reports for baremetal servers on legacy vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1080238 (owner: 10Ayounsi) [15:43:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:43:47] volans: thanks [15:44:17] I don't know why the alert didn't recover yet [15:45:00] 07sre-alert-triage, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): Alert in need of triage: PrometheusMysqldExporterFailed (instance dbstore1009:13350) - https://phabricator.wikimedia.org/T376977#10229886 (10Gehel) p:05Triage→03High [15:45:10] but ssh works fine [15:45:54] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:46:07] for vms, testing physical hosts too [15:46:41] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1080079 (https://phabricator.wikimedia.org/T377164) [15:46:43] (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1080079 (https://phabricator.wikimedia.org/T377164) (owner: 10Gerrit maintenance bot) [15:46:44] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1080079 (https://phabricator.wikimedia.org/T377164) (owner: 10Gerrit maintenance bot) [15:46:46] yeah seems good [15:47:02] volans: I see some NRPE alerts too, socket in unknown state [15:47:11] is NRPE related to how the keyholder alert is triggered? [15:47:42] https://usercontent.irccloud-cdn.com/file/4cyNNubq/image.png [15:47:45] it recovered in the meanwhile [15:47:53] (keyholder) [15:48:04] ok [15:48:12] systemctl is clean [15:48:16] https://alerts.wikimedia.org/?q=instance%3Dcloudcumin1001%3A9100 too [15:48:17] !log Starting s3 codfw failover from db2205 to db2209 - T377164 [15:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:28] RESOLVED: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [15:48:33] maybe just delayed recovery? [15:48:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set s3 codfw as read-only for maintenance - T377164', diff saved to https://phabricator.wikimedia.org/P69993 and previous config saved to /var/cache/conftool/dbconfig/20241015-154834-ladsgroup.json [15:48:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T376905)', diff saved to https://phabricator.wikimedia.org/P69994 and previous config saved to /var/cache/conftool/dbconfig/20241015-154844-ladsgroup.json [15:48:57] T377164: Switchover s3 master (db2205 -> db2209) - https://phabricator.wikimedia.org/T377164 [15:50:17] jynus: the switchover is stuck on "Trying to invert replication direction" [15:50:19] :(((( [15:50:51] ? [15:50:51] orchestrator is happy (beside pt-heartbeat) [15:50:58] T377164 [15:52:01] ok, regarding replication topology it looks ok [15:52:09] at least db2209 [15:52:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Promote db2209 to s3 primary and set section read-write T377164', diff saved to https://phabricator.wikimedia.org/P69995 and previous config saved to /var/cache/conftool/dbconfig/20241015-155240-ladsgroup.json [15:52:49] it was semi-sync I think [15:52:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T367781)', diff saved to https://phabricator.wikimedia.org/P69996 and previous config saved to /var/cache/conftool/dbconfig/20241015-155251-arnaudb.json [15:53:00] > result = master.execute("SET GLOBAL rpl_semi_sync_master_enabled = 0") [15:53:04] stuck at this [15:53:04] yeah, lately it has been producing issues [15:53:22] I think Arnaud hit one [15:53:27] let me check general log [15:53:32] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:53:44] https://www.irccloud.com/pastebin/XxHSne5E/ [15:54:03] we can fix stuff manually, let me see [15:54:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:54:58] (03PS2) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1080080 (https://phabricator.wikimedia.org/T377164) [15:55:04] (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1080080 (https://phabricator.wikimedia.org/T377164) (owner: 10Gerrit maintenance bot) [15:55:06] (03CR) 10Ladsgroup: [V:03+2 C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1080080 (https://phabricator.wikimedia.org/T377164) (owner: 10Gerrit maintenance bot) [15:55:06] Amir1: which server? [15:55:21] on the old one? [15:55:25] db2205 [15:55:26] yeah [15:55:37] ok, then if that is the one stuck, that's ok [15:55:43] I think you are going to rebuild it anyway [15:55:56] let me double check what's missing on the script and we can run it manually [15:55:56] ah yeah [15:56:00] I'm going to reclone it [15:56:15] so let's go over the script and see what could be missing [15:56:22] it wen't over the important stuff [15:56:24] *went [15:56:40] so it will be mostly semi sync, gtid zarcillo and so on [15:56:49] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4307/co" [puppet] - 10https://gerrit.wikimedia.org/r/1080270 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [15:56:49] maybe puppet [15:56:59] we do the puppet stuff ourselves [15:57:14] I can take care of zarcillo, there is a step for it to double check [15:57:19] we should do more work around the semisync [15:57:31] because that has happened twice already [15:57:37] * MichaelG_WMF is here if you want to get started ahead of time with the patch :) [15:57:52] and we should either workaround it or see how we can mitigate it [15:58:23] (03PS3) 10Ayounsi: re-image: ask user about migrating to per-rack vlan/IP [cookbooks] - 10https://gerrit.wikimedia.org/r/1080012 [15:58:58] MichaelG_WMF: hi! sure :) will you want to kick off a manual run to test it, or just wait for the hourly? [15:59:02] yeah, I do zarcillo [15:59:14] (03CR) 10RLazarus: [V:03+1 C:03+2] growthexperiments.pp: track dangling records for fr+eswiki hourly [puppet] - 10https://gerrit.wikimedia.org/r/1080270 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [15:59:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:59:18] so the step was "handle_old_master_semisync_replication" [15:59:30] FIRING: Device rebooted: Alert for device ps1-d6-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [15:59:37] please make sure heartbeat is working on the new one [15:59:43] I guess the hourly will be coming up pretty soon regardless :) [15:59:44] rzl: I already ran it manually previously, I'm not expecting any surprises there [15:59:52] Amir1: it should have been started by puppet [16:00:00] yeah, it works [16:00:04] MichaelG_WMF: okay cool, I'll just merge then, but I'll be around if you need any follow-up work [16:00:05] jhathaway and rzl: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T1600). [16:00:05] MichaelG_WMF: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:22] and on top, orchestrator works based on pt-heartbeat, if it's broken, everything will be yellow [16:00:31] and the other thing is : reenable_gtid_on_old_master, update_zarcillo and update_events [16:00:51] I updated zarcillo [16:00:52] gtid is just stop; slave_pos; start on the old master [16:00:56] I did events too [16:01:05] on both servers? [16:01:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db2205 T377164', diff saved to https://phabricator.wikimedia.org/P69997 and previous config saved to /var/cache/conftool/dbconfig/20241015-160106-ladsgroup.json [16:01:14] yeah [16:01:19] then that should be all [16:01:26] cool [16:01:28] the wmfmariadbpy is actually nice to read [16:01:35] and that is not my fault :-D [16:01:43] T377164: Switchover s3 master (db2205 -> db2209) - https://phabricator.wikimedia.org/T377164 [16:01:46] someone fixed it after I implemeted the logic [16:01:53] Amir1: https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/blob/main/wmfmariadbpy/cli_admin/switchover.py?ref_type=heads [16:02:00] it is actually easy to follow manually [16:02:35] and in this case, it has gone over the critical section, which is the topology changes [16:02:39] so all good [16:03:06] MichaelG_WMF: merged, and forced a puppet run on mwmaint2002 so it'll be there at 10 past, enjoy! let me know if you need anything [16:03:15] Amir1: as the old failed and we are going to rebuild it, not a big deal [16:03:41] although feel free to send me for help for a double check if you want after maintenance [16:03:52] rzl: Thank you! 🙏 [16:03:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P69998 and previous config saved to /var/cache/conftool/dbconfig/20241015-160351-ladsgroup.json [16:04:04] (I always love if someone double checks my work just in case) [16:04:30] RESOLVED: Device rebooted: Device ps1-d6-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [16:05:21] yeah [16:05:31] (03CR) 10Herron: [C:03+1] thanos: Add a recording rule for PHP FPM workers [puppet] - 10https://gerrit.wikimedia.org/r/1079453 (owner: 10Alexandros Kosiaris) [16:06:46] (03CR) 10Herron: [C:03+1] grafana: Ensure grafana-loki service auto restarts after system updates [puppet] - 10https://gerrit.wikimedia.org/r/1080090 (https://phabricator.wikimedia.org/T377166) (owner: 10Andrea Denisse) [16:07:50] 14SRE-Sprint-Week-Sustainability-March2023, 06Infrastructure-Foundations, 10Mail, 10Sustainability (Incident Followup): Upgrade Exim to 4.96 - https://phabricator.wikimedia.org/T310836#10230035 (10jhathaway) 05Open→03Invalid We have migrated postfix [16:07:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P69999 and previous config saved to /var/cache/conftool/dbconfig/20241015-160758-arnaudb.json [16:08:03] (03CR) 10Herron: [C:03+2] alertmanager-irc: improve ErrorBudgetBurn SLO alert text [puppet] - 10https://gerrit.wikimedia.org/r/1078718 (https://phabricator.wikimedia.org/T376740) (owner: 10Herron) [16:10:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T371742)', diff saved to https://phabricator.wikimedia.org/P70000 and previous config saved to /var/cache/conftool/dbconfig/20241015-161018-ladsgroup.json [16:10:54] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:14:25] FIRING: [4x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:11] stuck on "cumin2024@db2205.codfw.wmnet[(none)]> STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_Pos; START SLAVE;" [16:15:31] something is wrong with db2205 (wrong more than usual) [16:16:55] yeah, it may be stuck [16:17:03] that's the issue that happened before [16:17:24] as long as it is only that, we may be ok [16:17:48] arn*udb had to eventually kill it, I would start by depooling it [16:17:51] the exec thread is not moving forward [16:18:07] yeah, it may be stuck [16:18:09] yeah, I'm going to reclone it [16:18:18] if you are going to destroy the data, just kill it [16:18:25] it is the same issue [16:18:25] yup [16:18:39] and better do it know before it affects its parent [16:18:50] this was what caused the issue for the switchover [16:18:56] the replication control stopped responding [16:18:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P70001 and previous config saved to /var/cache/conftool/dbconfig/20241015-161858-ladsgroup.json [16:19:12] depool, downtime, kill [16:19:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool for reclone (T375652)', diff saved to https://phabricator.wikimedia.org/P70002 and previous config saved to /var/cache/conftool/dbconfig/20241015-161934-ladsgroup.json [16:19:49] this is s3 [16:19:59] so it maybe related to the amount of objects [16:20:08] as the last time it also happened on s3 [16:20:32] T375652: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '1' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry - https://phabricator.wikimedia.org/T375652 [16:21:40] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db2194.codfw.wmnet onto db2205.codfw.wmnet [16:21:52] Amir1: https://phabricator.wikimedia.org/T374425 [16:22:01] same server [16:22:18] although it was recloned [16:22:29] https://bash.toolforge.org/quip/rDHSeYMB6FQ6iqKiHR7m [16:22:30] so maybe it is hw? [16:23:01] yeah, once done I'm just going to check hw [16:23:05] what is your ticket? [16:23:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P70003 and previous config saved to /var/cache/conftool/dbconfig/20241015-162305-arnaudb.json [16:23:10] I want to link those [16:23:17] T375652 [16:23:31] and switchover subticket: T377164 [16:23:35] T377164: Switchover s3 master (db2205 -> db2209) - https://phabricator.wikimedia.org/T377164 [16:23:38] I'm so happy this is not the master anymore [16:23:52] (03PS1) 10RLazarus: deployment_server: Skip prometheus in mwscript-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1080321 (https://phabricator.wikimedia.org/T376714) [16:24:02] given it was recloned [16:24:09] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] grafana: Ensure grafana-loki service auto restarts after system updates [puppet] - 10https://gerrit.wikimedia.org/r/1080090 (https://phabricator.wikimedia.org/T377166) (owner: 10Andrea Denisse) [16:24:21] it is either s3 peculiarities (large amount of objects) or hw, would try hw first [16:25:21] we reduced number of objects in s3 a lot [16:25:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P70004 and previous config saved to /var/cache/conftool/dbconfig/20241015-162525-ladsgroup.json [16:25:26] I dropped a lot of random tables [16:26:05] I guess, but still in the order of dozens or hundreds of thousand tables [16:26:10] siiiiiiiigh, the stop slave on the cookbook is stuck [16:26:22] yeah, that is expected [16:26:48] that is why I setup when I can a timeout there, it usually it is instant... except when it is not [16:26:53] good call catching it [16:27:43] leaving you on your own to handle it [16:27:55] I don't think I am useful anymore [16:27:59] now, I need to do it manually [16:28:03] thank you <3 [16:28:04] he he [16:29:19] btw, that stop replica is there preciselly to catch those outliers [16:29:36] you want to know if it is stuck rather than restart blindly [16:30:16] (but befor it happens to you, one would say "useless/not needed") [16:30:30] (03CR) 10Scott French: [C:03+1] deployment_server: Skip prometheus in mwscript-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1080321 (https://phabricator.wikimedia.org/T376714) (owner: 10RLazarus) [16:34:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T376905)', diff saved to https://phabricator.wikimedia.org/P70005 and previous config saved to /var/cache/conftool/dbconfig/20241015-163404-ladsgroup.json [16:34:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [16:34:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [16:34:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T376905)', diff saved to https://phabricator.wikimedia.org/P70006 and previous config saved to /var/cache/conftool/dbconfig/20241015-163419-ladsgroup.json [16:38:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T367781)', diff saved to https://phabricator.wikimedia.org/P70007 and previous config saved to /var/cache/conftool/dbconfig/20241015-163812-arnaudb.json [16:38:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2188.codfw.wmnet with reason: Maintenance [16:38:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2188.codfw.wmnet with reason: Maintenance [16:38:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T367781)', diff saved to https://phabricator.wikimedia.org/P70008 and previous config saved to /var/cache/conftool/dbconfig/20241015-163834-arnaudb.json [16:38:45] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [16:40:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P70009 and previous config saved to /var/cache/conftool/dbconfig/20241015-164032-ladsgroup.json [16:40:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T367781)', diff saved to https://phabricator.wikimedia.org/P70010 and previous config saved to /var/cache/conftool/dbconfig/20241015-164050-arnaudb.json [16:41:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T376905)', diff saved to https://phabricator.wikimedia.org/P70011 and previous config saved to /var/cache/conftool/dbconfig/20241015-164127-ladsgroup.json [16:42:50] (03CR) 10RLazarus: [C:03+2] deployment_server: Skip prometheus in mwscript-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1080321 (https://phabricator.wikimedia.org/T376714) (owner: 10RLazarus) [16:45:05] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) (owner: 10Jcrespo) [16:54:25] FIRING: [4x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T371742)', diff saved to https://phabricator.wikimedia.org/P70012 and previous config saved to /var/cache/conftool/dbconfig/20241015-165539-ladsgroup.json [16:55:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [16:55:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [16:55:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P70013 and previous config saved to /var/cache/conftool/dbconfig/20241015-165556-arnaudb.json [16:56:00] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:56:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T371742)', diff saved to https://phabricator.wikimedia.org/P70014 and previous config saved to /var/cache/conftool/dbconfig/20241015-165608-ladsgroup.json [16:56:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P70015 and previous config saved to /var/cache/conftool/dbconfig/20241015-165634-ladsgroup.json [16:59:25] FIRING: [4x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:05] swfrench-wmf: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T1700). [17:00:39] here, will get started on this shortly [17:01:45] (03CR) 10Scott French: [C:03+2] hieradata: convert remaining mw_releases entries [puppet] - 10https://gerrit.wikimedia.org/r/1077482 (https://phabricator.wikimedia.org/T370934) (owner: 10Scott French) [17:03:21] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10230424 (10BTullis) [17:08:07] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10230435 (10nisrael) Apologies, I was out of office last week. @jhathaway and @Dzahn here is the .eml file! {F57617255} [17:08:16] (03PS8) 10Scott French: types: remove older Mediawiki_deployment variant [puppet] - 10https://gerrit.wikimedia.org/r/1077483 (https://phabricator.wikimedia.org/T370934) [17:08:35] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077483 (https://phabricator.wikimedia.org/T370934) (owner: 10Scott French) [17:10:39] !log swfrench@deploy2002 Started scap sync-world: Testing scap after mediawiki-deployments.yaml format change - T370934 [17:11:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P70016 and previous config saved to /var/cache/conftool/dbconfig/20241015-171103-arnaudb.json [17:11:12] T370934: Build and publish multiple MediaWiki production images for a given set of PHP versions - https://phabricator.wikimedia.org/T370934 [17:11:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P70017 and previous config saved to /var/cache/conftool/dbconfig/20241015-171141-ladsgroup.json [17:13:27] !log swfrench@deploy2002 Finished scap sync-world: Testing scap after mediawiki-deployments.yaml format change - T370934 (duration: 02m 47s) [17:14:35] (03PS1) 10RLazarus: mw-script: Die with a clearer error when $RELEASE_NAME is "prometheus" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080331 (https://phabricator.wikimedia.org/T376714) [17:16:34] I'm finished with the mediawiki infra window [17:18:31] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10230470 (10Papaul) [17:23:58] (03PS1) 10DCausse: cirrus: cleanup removed label_count field on next re-index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080332 (https://phabricator.wikimedia.org/T377226) [17:26:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T367781)', diff saved to https://phabricator.wikimedia.org/P70018 and previous config saved to /var/cache/conftool/dbconfig/20241015-172610-arnaudb.json [17:26:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2202.codfw.wmnet with reason: Maintenance [17:26:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2202.codfw.wmnet with reason: Maintenance [17:26:34] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:26:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2212.codfw.wmnet with reason: Maintenance [17:26:48] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T377059#10230507 (10VRiley-WMF) a:03VRiley-WMF [17:26:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T376905)', diff saved to https://phabricator.wikimedia.org/P70019 and previous config saved to /var/cache/conftool/dbconfig/20241015-172648-ladsgroup.json [17:26:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2212.codfw.wmnet with reason: Maintenance [17:26:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance [17:26:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T367781)', diff saved to https://phabricator.wikimedia.org/P70020 and previous config saved to /var/cache/conftool/dbconfig/20241015-172657-arnaudb.json [17:27:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance [17:27:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T376905)', diff saved to https://phabricator.wikimedia.org/P70021 and previous config saved to /var/cache/conftool/dbconfig/20241015-172714-ladsgroup.json [17:27:41] (03CR) 10Jcrespo: [C:03+1] "Thank you very much, Andrea!" [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) (owner: 10Jcrespo) [17:27:46] (03CR) 10Jcrespo: [C:03+2] check footer legal complience: Add support for relative URLs [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) (owner: 10Jcrespo) [17:27:51] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T377059#10230515 (10VRiley-WMF) Opened a ticket with Dell to replace the part. Will get the part shipped to the eqiad. (Dell ticket 199327299 ) [17:29:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T367781)', diff saved to https://phabricator.wikimedia.org/P70022 and previous config saved to /var/cache/conftool/dbconfig/20241015-172912-arnaudb.json [17:30:06] (03PS7) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 [17:34:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T376905)', diff saved to https://phabricator.wikimedia.org/P70023 and previous config saved to /var/cache/conftool/dbconfig/20241015-173409-ladsgroup.json [17:37:27] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10230551 (10Papaul) [17:40:11] (03PS1) 10Scott French: hiddenparma: add fake keyholder secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1080333 (https://phabricator.wikimedia.org/T371782) [17:44:09] (03CR) 10Scott French: [V:03+2 C:03+2] hiddenparma: add fake keyholder secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1080333 (https://phabricator.wikimedia.org/T371782) (owner: 10Scott French) [17:44:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P70024 and previous config saved to /var/cache/conftool/dbconfig/20241015-174419-arnaudb.json [17:48:06] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077483 (https://phabricator.wikimedia.org/T370934) (owner: 10Scott French) [17:49:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P70025 and previous config saved to /var/cache/conftool/dbconfig/20241015-174916-ladsgroup.json [17:54:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10230643 (10phaultfinder) [17:59:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P70026 and previous config saved to /var/cache/conftool/dbconfig/20241015-175926-arnaudb.json [18:00:04] jeena and andre: MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T1800). Please do the needful. [18:00:14] o/ [18:00:42] (03CR) 10Scott French: [C:03+1] "That's a lot more straightforward to do than I expected!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080331 (https://phabricator.wikimedia.org/T376714) (owner: 10RLazarus) [18:04:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P70027 and previous config saved to /var/cache/conftool/dbconfig/20241015-180423-ladsgroup.json [18:14:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T367781)', diff saved to https://phabricator.wikimedia.org/P70028 and previous config saved to /var/cache/conftool/dbconfig/20241015-181433-arnaudb.json [18:14:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2216.codfw.wmnet with reason: Maintenance [18:14:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2216.codfw.wmnet with reason: Maintenance [18:14:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T367781)', diff saved to https://phabricator.wikimedia.org/P70029 and previous config saved to /var/cache/conftool/dbconfig/20241015-181455-arnaudb.json [18:15:05] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:17:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T367781)', diff saved to https://phabricator.wikimedia.org/P70030 and previous config saved to /var/cache/conftool/dbconfig/20241015-181711-arnaudb.json [18:17:27] (03CR) 10RLazarus: [C:03+2] mw-script: Die with a clearer error when $RELEASE_NAME is "prometheus" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080331 (https://phabricator.wikimedia.org/T376714) (owner: 10RLazarus) [18:18:27] (03Merged) 10jenkins-bot: mw-script: Die with a clearer error when $RELEASE_NAME is "prometheus" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080331 (https://phabricator.wikimedia.org/T376714) (owner: 10RLazarus) [18:19:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T376905)', diff saved to https://phabricator.wikimedia.org/P70031 and previous config saved to /var/cache/conftool/dbconfig/20241015-181930-ladsgroup.json [18:26:23] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack pfw3 and old fasw decommission - https://phabricator.wikimedia.org/T377254 (10Papaul) 03NEW [18:28:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T371742)', diff saved to https://phabricator.wikimedia.org/P70032 and previous config saved to /var/cache/conftool/dbconfig/20241015-182800-ladsgroup.json [18:28:32] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:31:35] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [18:32:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P70033 and previous config saved to /var/cache/conftool/dbconfig/20241015-183218-arnaudb.json [18:34:51] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2081-3 to codfw - jhancock@cumin2002" [18:34:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2081-3 to codfw - jhancock@cumin2002" [18:34:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:35:39] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2081 [18:35:41] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2082 [18:35:42] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2083 [18:35:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2083 [18:35:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2081 [18:35:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2082 [18:36:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:36:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:36:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:37:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:37:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:37:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:42:11] (03PS1) 10Simon04: Revert "wwwportals: clean up query string on www.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 [18:42:23] (03CR) 10CI reject: [V:04-1] Revert "wwwportals: clean up query string on www.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (owner: 10Simon04) [18:43:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P70034 and previous config saved to /var/cache/conftool/dbconfig/20241015-184307-ladsgroup.json [18:46:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:47:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P70035 and previous config saved to /var/cache/conftool/dbconfig/20241015-184724-arnaudb.json [18:48:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:49:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:50:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:50:18] (03PS2) 10Simon04: Revert "wwwportals: clean up query string on www.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) [18:50:29] (03CR) 10CI reject: [V:04-1] Revert "wwwportals: clean up query string on www.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [18:51:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:52:28] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack pfw3 and old fasw decommission - https://phabricator.wikimedia.org/T377254#10230870 (10Papaul) [18:52:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:53:10] (03CR) 10Dzahn: [C:03+1] "looks good. I think it also avoids a duplicate declaration by changing the resource name?" [puppet] - 10https://gerrit.wikimedia.org/r/1079395 (owner: 10Muehlenhoff) [18:54:02] (03PS3) 10Simon04: Revert "wwwportals: clean up query string on www.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) [18:55:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2081.codfw.wmnet with OS bullseye [18:55:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [18:55:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [18:55:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10230871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2081.codfw.wmnet with OS bullseye [18:55:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10230872 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [18:55:19] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10230873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye [18:55:56] (03CR) 10CI reject: [V:04-1] Revert "wwwportals: clean up query string on www.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [18:58:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P70036 and previous config saved to /var/cache/conftool/dbconfig/20241015-185814-ladsgroup.json [19:02:27] (03CR) 10Dzahn: [C:03+2] "being bold and just deploying it. as you say, can be reverted if needed." [puppet] - 10https://gerrit.wikimedia.org/r/1080313 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:02:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T367781)', diff saved to https://phabricator.wikimedia.org/P70037 and previous config saved to /var/cache/conftool/dbconfig/20241015-190231-arnaudb.json [19:02:52] (03CR) 10Herron: [C:03+2] opentelemetry::collector: set default port and update template [puppet] - 10https://gerrit.wikimedia.org/r/1076006 (https://phabricator.wikimedia.org/T376179) (owner: 10Herron) [19:03:06] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [19:04:55] puppet-merge conflict. waiting for lock. [19:05:03] also feel free to "multiple" [19:05:09] mutante: thanks just did [19:05:27] thanks herron [19:05:49] (03CR) 10Herron: [C:03+2] thanos-query: set OTEL_SERVICE_NAME env variable [puppet] - 10https://gerrit.wikimedia.org/r/1077068 (https://phabricator.wikimedia.org/T376179) (owner: 10Herron) [19:06:04] FYI, I will now start promoting group0 wikis to 1.43.0-wmf.27 [19:07:09] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080363 (https://phabricator.wikimedia.org/T375658) [19:07:11] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080363 (https://phabricator.wikimedia.org/T375658) (owner: 10TrainBranchBot) [19:07:51] (03Abandoned) 10Dzahn: tlsproxy::envoy: do not use ferm::service if firewall_src_sets is set [puppet] - 10https://gerrit.wikimedia.org/r/1076889 (owner: 10Dzahn) [19:08:14] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080363 (https://phabricator.wikimedia.org/T375658) (owner: 10TrainBranchBot) [19:10:08] (03PS2) 10Scott French: echostore: remove per-env certs.kask override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079569 (https://phabricator.wikimedia.org/T376766) [19:13:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T371742)', diff saved to https://phabricator.wikimedia.org/P70038 and previous config saved to /var/cache/conftool/dbconfig/20241015-191322-ladsgroup.json [19:13:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2162.codfw.wmnet with reason: Maintenance [19:13:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2162.codfw.wmnet with reason: Maintenance [19:13:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T371742)', diff saved to https://phabricator.wikimedia.org/P70039 and previous config saved to /var/cache/conftool/dbconfig/20241015-191345-ladsgroup.json [19:13:55] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [19:14:47] (03CR) 10Xcollazo: [C:03+1] dumps: Mark globalblocks dir and script as absent [puppet] - 10https://gerrit.wikimedia.org/r/1080272 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [19:15:01] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.27 refs T375658 [19:17:07] T375658: 1.43.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T375658 [19:17:56] FIRING: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [19:18:01] Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [19:22:56] RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [19:25:15] (03PS5) 10RLazarus: deployment_server: Tweak mwscript-cleanup `helm list` pagination [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) [19:27:49] (03PS6) 10RLazarus: deployment_server: Read Helm secrets in `mwscript-cleanup` [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) [19:29:30] FIRING: Device rebooted: Alert for device ps1-c3-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [19:29:55] (03CR) 10RLazarus: "PTAL -- sorry for the late-breaking rewrite, but as discussed I think this approach gets us out of a lot of worrying wrt your extremely we" [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [19:32:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) (owner: 10Pppery) [19:34:30] RESOLVED: Device rebooted: Device ps1-c3-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [19:34:32] (03CR) 10RLazarus: [C:03+1] echostore: remove per-env certs.kask override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079569 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [19:47:46] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1074468/4309/releases1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [19:51:49] (03PS7) 10Pppery: Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (https://phabricator.wikimedia.org/T376923) [19:52:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [19:52:53] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-10-08-175830 to 2024-10-15-192817 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080367 (https://phabricator.wikimedia.org/T375922) [19:52:54] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-10-08-175510 to 2024-10-10-202633 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080368 (https://phabricator.wikimedia.org/T375922) [19:53:16] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-10-08-175830 to 2024-10-15-192817 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080367 (https://phabricator.wikimedia.org/T375922) (owner: 10Jforrester) [19:54:18] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-10-08-175830 to 2024-10-15-192817 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080367 (https://phabricator.wikimedia.org/T375922) (owner: 10Jforrester) [19:54:30] FIRING: Device rebooted: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [19:55:26] (03PS9) 10Pppery: Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 (https://phabricator.wikimedia.org/T376923) [19:56:09] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:56:39] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:57:21] (03CR) 10Dzahn: [V:03+1 C:03+2] "on contint2002: Notice: /Stage[main]/Apt/File[/etc/apt/sources.list.d/jenkins-thirdparty-ci.list]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [19:58:42] (03CR) 10Dzahn: [C:03+2] ci: fix git mirror not fetching branches [puppet] - 10https://gerrit.wikimedia.org/r/1079472 (https://phabricator.wikimedia.org/T376981) (owner: 10Hashar) [19:59:26] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [19:59:30] RESOLVED: Device rebooted: Device ps1-b2-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241015T2000). [20:00:05] kimberly_sarabia, Ammar, and Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] Here [20:00:24] My two config patches are not dependent on each other at all and can be deployed in any order, or together [20:00:28] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [20:00:44] Hello. I need help backporting. Thanks! [20:00:48] o/ [20:00:51] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [20:00:52] i can deploy [20:01:04] cjming: Thank you [20:01:28] (03PS5) 10Kimberly Sarabia: Remove legacy UI actions tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077504 (https://phabricator.wikimedia.org/T376065) [20:01:36] (03PS1) 10Bartosz Dziewoński: SkinComponentCopyright: Fix message existence check for history-copyright [core] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1080369 (https://phabricator.wikimedia.org/T45646) [20:01:36] (03PS1) 10Jforrester: wikifunctions: Configure the main API access route for Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080370 [20:01:38] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [20:01:47] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-10-08-175510 to 2024-10-10-202633 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080368 (https://phabricator.wikimedia.org/T375922) (owner: 10Jforrester) [20:01:51] bah, i'm late again, i hope i can still add a backport [20:02:01] i'll go in order [20:02:02] (03CR) 10Scott French: "Looks good! Agreed that, at the expense of relying on a helm implementation detail, this is much easier to get right than shelling out to " [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [20:02:10] MatmaRex: np - can you add to cal? [20:02:17] yeah, doing [20:02:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077504 (https://phabricator.wikimedia.org/T376065) (owner: 10Kimberly Sarabia) [20:02:51] i really like bd808's new backport tool, but i feel like it judges me when it doesn't let me add patches to a window that is already in progress ;) [20:02:53] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-10-08-175510 to 2024-10-10-202633 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080368 (https://phabricator.wikimedia.org/T375922) (owner: 10Jforrester) [20:02:57] MatmaRex: :-D [20:03:02] (It does.) [20:03:19] (03Merged) 10jenkins-bot: Remove legacy UI actions tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077504 (https://phabricator.wikimedia.org/T376065) (owner: 10Kimberly Sarabia) [20:03:21] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:03:27] oh. you can hack it? [20:03:40] Sorry, I meant, it does judge you. [20:03:48] lol [20:03:48] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1077504|Remove legacy UI actions tracking (T376065)]] [20:03:50] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:04:03] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [20:04:19] T376065: Delete redundant mobile- and desktopwebuiactions event in WikimediaEvents - https://phabricator.wikimedia.org/T376065 [20:04:39] oh :D [20:04:57] (03CR) 10Scott French: "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079569 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [20:05:03] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [20:05:22] (03CR) 10Scott French: [C:03+2] echostore: remove per-env certs.kask override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079569 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [20:05:34] added to the calendar, the patch is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1080369 [20:06:04] !log cjming@deploy2002 ksarabia, cjming: Backport for [[gerrit:1077504|Remove legacy UI actions tracking (T376065)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:06:04] (i backported it to wmf.27 earlier today, then i realized it's needed in wmf.26 too) [20:06:08] (03CR) 10Dzahn: [C:03+2] "thanks! I spent some time wondering why I can't get those to show up as running." [puppet] - 10https://gerrit.wikimedia.org/r/1079382 (owner: 10Krinkle) [20:06:10] kimberly_sarabia: up on test servers if you'd like to verify [20:06:22] MatmaRex: sounds good [20:06:25] (03Merged) 10jenkins-bot: echostore: remove per-env certs.kask override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079569 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [20:06:28] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [20:06:36] (03PS8) 10Pppery: Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (https://phabricator.wikimedia.org/T376923) [20:07:03] Ammar: are you around? [20:07:11] (03CR) 10Jforrester: [C:03+2] wikifunctions: Configure the main API access route for Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080370 (owner: 10Jforrester) [20:07:15] cjming Yes [20:07:30] cool - i'll do yours next [20:07:33] I did some last-minute testing of the patch I had scheduled for this deployment window and caught (and fixed) a mistake [20:07:37] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [20:08:08] Pppery: nice - there's still a little time before i get to your patches [20:08:26] kimberly_sarabia: lmk if/when i can sync [20:08:36] (03Merged) 10jenkins-bot: wikifunctions: Configure the main API access route for Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080370 (owner: 10Jforrester) [20:08:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2081.codfw.wmnet with OS bullseye [20:09:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2082.codfw.wmnet with OS bullseye [20:09:03] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10231221 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2081.codfw.wmnet with OS bullseye execute... [20:09:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10231222 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye execute... [20:09:31] MatmaRex: is it reasonable to expect your backport to take 20+ minutes to merge? [20:09:36] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:09:42] thank cjming for your service <3 [20:09:56] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:10:01] kindrobot: likewise! [20:10:06] cjming: sounds good, thanks! [20:10:16] All backports of code (rather than config) take 20+ minutes to merge, don't they? [20:10:30] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [20:10:36] cjming: yes, please give it a +2 early if you can [20:10:40] i think so - it's been a while since i've done one - just wondering if they got faster in the meantime [20:10:49] MatmaRex: will do [20:11:13] kimberly_sarabia: was that confirmation for me to sync. your patch? [20:11:15] the one this morning apparently took 38 minutes… wtf https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1080279 [20:11:17] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [20:11:27] gah! [20:11:29] i got used to 20, but 38 seems unusual [20:11:33] merging now [20:11:34] cjming: yes! [20:11:42] !log cjming@deploy2002 ksarabia, cjming: Continuing with sync [20:11:45] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [20:11:59] (03CR) 10Clare Ming: [C:03+2] SkinComponentCopyright: Fix message existence check for history-copyright [core] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1080369 (https://phabricator.wikimedia.org/T45646) (owner: 10Bartosz Dziewoński) [20:12:25] (03PS5) 10Ammarpad: contactpages: Move stewards contactpage to MetaContactPages.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080023 [20:12:27] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [20:15:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bullseye [20:15:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10231244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye execute... [20:16:17] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077504|Remove legacy UI actions tracking (T376065)]] (duration: 12m 28s) [20:16:35] (03CR) 10Dzahn: [C:03+1] "I don't really have the context to fully understand this ticket and haven't tested the code. But nevertheless I think you should just mer" [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) (owner: 10JHathaway) [20:17:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080023 (owner: 10Ammarpad) [20:17:23] T376065: Delete redundant mobile- and desktopwebuiactions event in WikimediaEvents - https://phabricator.wikimedia.org/T376065 [20:17:24] kimberly_sarabia: should be live! [20:17:42] thanks! [20:17:49] yw! [20:18:40] (03Merged) 10jenkins-bot: contactpages: Move stewards contactpage to MetaContactPages.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080023 (owner: 10Ammarpad) [20:19:29] (03PS7) 10RLazarus: deployment_server: Read Helm secrets in `mwscript-cleanup` [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) [20:19:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:20:57] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1080023|contactpages: Move stewards contactpage to MetaContactPages.php]] [20:22:31] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10231262 (10RobH) [20:23:10] !log cjming@deploy2002 ammarpad, cjming: Backport for [[gerrit:1080023|contactpages: Move stewards contactpage to MetaContactPages.php]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:23:18] Ammar: if your patch is testable, it's up on mwdebug - lmk if/when to sync [20:24:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:24:42] (03Abandoned) 10Jforrester: wikifunctions: Add mw-api-int-async-ro route for Wikidata fetches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079326 (owner: 10Jforrester) [20:26:12] (03CR) 10RLazarus: deployment_server: Read Helm secrets in `mwscript-cleanup` (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [20:27:00] cjming: I tested, it looks good to me. You can proceed. [20:27:04] thank you [20:27:10] np! syncing [20:27:13] !log cjming@deploy2002 ammarpad, cjming: Continuing with sync [20:27:34] (03PS6) 10Pppery: Missing.php: Improve detection of interwikis in certain cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) [20:28:42] (03CR) 10Scott French: [C:03+1] deployment_server: Read Helm secrets in `mwscript-cleanup` (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [20:31:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [20:31:09] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10231305 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye [20:31:53] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1080023|contactpages: Move stewards contactpage to MetaContactPages.php]] (duration: 10m 56s) [20:32:01] (03PS1) 10Cwhite: logstash: add test suite basis for containerd logs [puppet] - 10https://gerrit.wikimedia.org/r/1080373 (https://phabricator.wikimedia.org/T377132) [20:32:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) (owner: 10Pppery) [20:32:25] Ammar: your patch should be live :) [20:32:43] Pppery: onto yours [20:32:48] ok [20:32:54] (03Merged) 10jenkins-bot: Missing.php: Improve detection of interwikis in certain cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) (owner: 10Pppery) [20:33:14] i'm going to deploy them separately since they touch the same file - it's probably ok but i'm not sure what will happen [20:33:21] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1075957|Missing.php: Improve detection of interwikis in certain cases (T363538)]] [20:33:38] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [20:34:03] (03PS9) 10Pppery: Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (https://phabricator.wikimedia.org/T376923) [20:35:07] cjming: Great, thank you [20:35:31] yw! [20:35:38] !log cjming@deploy2002 cjming, pppery: Backport for [[gerrit:1075957|Missing.php: Improve detection of interwikis in certain cases (T363538)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:36:01] Pppery: 1st patch on test servers if verifiable [20:36:07] looking [20:36:10] it is verifiable [20:37:05] (03CR) 10RLazarus: [C:03+2] "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [20:37:19] Works [20:37:34] cool ! syncing [20:37:38] !log cjming@deploy2002 cjming, pppery: Continuing with sync [20:39:01] (03Merged) 10jenkins-bot: SkinComponentCopyright: Fix message existence check for history-copyright [core] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1080369 (https://phabricator.wikimedia.org/T45646) (owner: 10Bartosz Dziewoński) [20:39:40] So matmarex's patch took 25 minutes as about expected [20:42:12] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1075957|Missing.php: Improve detection of interwikis in certain cases (T363538)]] (duration: 08m 50s) [20:42:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [20:42:31] T363538: Deal with Manual of Style pseudo-namespaces conflicting with Mooré Wikipedia - https://phabricator.wikimedia.org/T363538 [20:42:41] i wonder what happened this morning that made it so slow. [20:43:03] (03Merged) 10jenkins-bot: Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [20:43:16] (03CR) 10Cwhite: [C:03+2] logstash: add test suite basis for containerd logs [puppet] - 10https://gerrit.wikimedia.org/r/1080373 (https://phabricator.wikimedia.org/T377132) (owner: 10Cwhite) [20:43:56] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1079054|Redirect all namespace-in-Wikipedia cases to Wikipedia (T376923)]] [20:44:11] T376923: Setup missing.php layer redirects for wikipedia hosting the other projects too - https://phabricator.wikimedia.org/T376923 [20:44:57] (03CR) 10JHathaway: "Thanks, I think that is sound advice, I'll go ahead and merge, easy enough to revert." [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) (owner: 10JHathaway) [20:46:10] !log cjming@deploy2002 cjming, pppery: Backport for [[gerrit:1079054|Redirect all namespace-in-Wikipedia cases to Wikipedia (T376923)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:46:28] Pppery: 2nd patch on mwdebug [20:46:32] looking [20:46:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T371742)', diff saved to https://phabricator.wikimedia.org/P70040 and previous config saved to /var/cache/conftool/dbconfig/20241015-204642-ladsgroup.json [20:47:03] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:48:54] Still looking, have to check every domain added which is taking a while [20:49:42] FIRING: Device rebooted: Alert for device ps1-c4-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [20:50:44] Pppery: np - take your time [20:51:41] Seems to work [20:51:54] cool - going live! [20:51:56] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1079453 (owner: 10Alexandros Kosiaris) [20:51:58] !log cjming@deploy2002 cjming, pppery: Continuing with sync [20:52:13] thanks [20:54:42] RESOLVED: Device rebooted: Device ps1-c4-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [20:55:40] (03CR) 10Dzahn: "It seems this has been resolved by the alternative approach linked below which has been merged meanwhile." [puppet] - 10https://gerrit.wikimedia.org/r/970877 (https://phabricator.wikimedia.org/T249648) (owner: 10BCornwall) [20:56:02] (03CR) 10RLazarus: [C:03+1] types: remove older Mediawiki_deployment variant [puppet] - 10https://gerrit.wikimedia.org/r/1077483 (https://phabricator.wikimedia.org/T370934) (owner: 10Scott French) [20:56:09] (03CR) 10Dzahn: "Yea, it's already redirecting sco.wiktionary.org/wiki" [puppet] - 10https://gerrit.wikimedia.org/r/970877 (https://phabricator.wikimedia.org/T249648) (owner: 10BCornwall) [20:56:30] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1079054|Redirect all namespace-in-Wikipedia cases to Wikipedia (T376923)]] (duration: 12m 33s) [20:57:02] (03CR) 10Pppery: "Yep. This can be abandoned as I decided to do it at a lower layer for more flexibility." [puppet] - 10https://gerrit.wikimedia.org/r/970877 (https://phabricator.wikimedia.org/T249648) (owner: 10BCornwall) [20:57:08] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1080369|SkinComponentCopyright: Fix message existence check for history-copyright (T45646)]] [20:57:10] T376923: Setup missing.php layer redirects for wikipedia hosting the other projects too - https://phabricator.wikimedia.org/T376923 [20:57:36] !log ladsgroup@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2194.codfw.wmnet onto db2205.codfw.wmnet [20:57:36] T45646: "MediaWiki:Copyright" message allows raw HTML - https://phabricator.wikimedia.org/T45646 [20:57:55] MatmaRex: I think since your patch got merged before the last scap backport, it actually got deployed already with the last patch [20:58:01] i think [20:58:08] oh. oops. let me test [20:58:20] I thought scap was supposed to warn when that happened [20:58:20] anyway just to be sure, i ran scap backport on your changeset [20:58:33] it did - i went ahead with it [20:58:44] Pppery: your 2nd patch should be live btw [20:58:55] THanks [20:59:10] cjming: yeah, i see the fixed behavior, even without mwdebug. everything seems to work fine [20:59:19] !log cjming@deploy2002 cjming, matmarex: Backport for [[gerrit:1080369|SkinComponentCopyright: Fix message existence check for history-copyright (T45646)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:59:22] cool - gtk [20:59:25] FIRING: [2x] SystemdUnitFailed: mediawiki_job_growthexperiments-fixLinkRecommendationData-dryrun-eswiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:25] !log cjming@deploy2002 cjming, matmarex: Continuing with sync [20:59:40] then i'm not sure if ^^ will do anything but oh well [21:01:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P70041 and previous config saved to /var/cache/conftool/dbconfig/20241015-210149-ladsgroup.json [21:03:15] (03CR) 10Dzahn: [C:03+2] "The only difference is now the MOTD and the values for the throttling: https://puppet-compiler.wmflabs.org/output/1063893/4310/gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1063893 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [21:03:59] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1080369|SkinComponentCopyright: Fix message existence check for history-copyright (T45646)]] (duration: 06m 51s) [21:04:03] (03PS6) 10Pppery: Remove als redirects [puppet] - 10https://gerrit.wikimedia.org/r/1079056 (https://phabricator.wikimedia.org/T376923) [21:04:06] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1077483 (https://phabricator.wikimedia.org/T370934) (owner: 10Scott French) [21:04:11] (03CR) 10Scott French: [C:03+2] types: remove older Mediawiki_deployment variant [puppet] - 10https://gerrit.wikimedia.org/r/1077483 (https://phabricator.wikimedia.org/T370934) (owner: 10Scott French) [21:04:16] (03CR) 10JHathaway: [C:03+2] vrts_aliases: query database for valid addresses [puppet] - 10https://gerrit.wikimedia.org/r/1074200 (https://phabricator.wikimedia.org/T374090) (owner: 10JHathaway) [21:04:20] T45646: "MediaWiki:Copyright" message allows raw HTML - https://phabricator.wikimedia.org/T45646 [21:04:29] !log end of UTC late backport window [21:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:35] (03CR) 10JHathaway: [C:03+2] vrts_aliases: add a basic safeguard, improve existing safeguards [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) (owner: 10JHathaway) [21:05:03] swfrench-wmf: good to merge your change? [21:05:15] jhathaway: please and thank you :) [21:05:24] great, done [21:06:37] (03CR) 10Pppery: "Note that this and the corresponding config patch can be deployed in either order. If this one is done first then als.wikibooks.org will p" [puppet] - 10https://gerrit.wikimedia.org/r/1079056 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [21:06:49] (03PS3) 10Ladsgroup: dumps: Mark globalblocks dir and script as absent [puppet] - 10https://gerrit.wikimedia.org/r/1080272 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [21:06:59] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1080272 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [21:08:46] (03CR) 10Dzahn: "ACK! thanks, I +1ed that one" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:09:13] (03CR) 10Dzahn: [C:04-1] "also abandoned https://gerrit.wikimedia.org/r/c/operations/puppet/+/1076889 which seemed like a near duplicate of that now" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:09:44] (03PS2) 10Dzahn: gerrit: delete temp gerrit setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074488 [21:09:46] (03PS4) 10Ladsgroup: dumps: Mark globalblocks dir and script as absent [puppet] - 10https://gerrit.wikimedia.org/r/1080272 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [21:09:48] (03CR) 10Ladsgroup: [C:03+2] dumps: Mark globalblocks dir and script as absent [puppet] - 10https://gerrit.wikimedia.org/r/1080272 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [21:09:49] (03CR) 10Ladsgroup: [V:03+2 C:03+2] dumps: Mark globalblocks dir and script as absent [puppet] - 10https://gerrit.wikimedia.org/r/1080272 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [21:11:40] (03PS3) 10Dzahn: gerrit: delete temp gerrit setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074488 [21:15:21] (03PS1) 10Dzahn: gerrit: remove Hiera keys on host gerrit2003 that are applied by role [puppet] - 10https://gerrit.wikimedia.org/r/1080379 (https://phabricator.wikimedia.org/T372804) [21:15:32] (03CR) 10CI reject: [V:04-1] gerrit: remove Hiera keys on host gerrit2003 that are applied by role [puppet] - 10https://gerrit.wikimedia.org/r/1080379 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [21:16:21] (03PS2) 10Dzahn: gerrit: remove Hiera keys on host gerrit2003 that are applied by role [puppet] - 10https://gerrit.wikimedia.org/r/1080379 (https://phabricator.wikimedia.org/T372804) [21:16:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P70042 and previous config saved to /var/cache/conftool/dbconfig/20241015-211656-ladsgroup.json [21:18:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P70043 and previous config saved to /var/cache/conftool/dbconfig/20241015-211800-ladsgroup.json [21:23:13] (03CR) 10Dzahn: [V:03+1 C:03+2] "no change - https://puppet-compiler.wmflabs.org/output/1080379/4311/gerrit2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1080379 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [21:24:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [21:24:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [21:24:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T370903)', diff saved to https://phabricator.wikimedia.org/P70044 and previous config saved to /var/cache/conftool/dbconfig/20241015-212431-ladsgroup.json [21:25:02] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:25:03] (03CR) 10Dzahn: "We might have waited long enough now?" [puppet] - 10https://gerrit.wikimedia.org/r/1073740 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [21:25:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2205.codfw.wmnet with reason: Sad [21:25:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2205.codfw.wmnet with reason: Sad [21:26:32] (03PS5) 10Pppery: Configure namespaces, sitenames, and timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 (https://phabricator.wikimedia.org/T377160) [21:27:10] (03CR) 10CI reject: [V:04-1] Configure namespaces, sitenames, and timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 (https://phabricator.wikimedia.org/T377160) (owner: 10Pppery) [21:27:40] (03PS6) 10Pppery: Configure namespaces, sitenames, and timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 (https://phabricator.wikimedia.org/T377160) [21:28:22] (03CR) 10CI reject: [V:04-1] Configure namespaces, sitenames, and timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 (https://phabricator.wikimedia.org/T377160) (owner: 10Pppery) [21:28:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T370903)', diff saved to https://phabricator.wikimedia.org/P70045 and previous config saved to /var/cache/conftool/dbconfig/20241015-212835-ladsgroup.json [21:28:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [21:29:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [21:30:12] (03PS7) 10Pppery: Configure namespaces, sitenames, and timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 (https://phabricator.wikimedia.org/T377160) [21:32:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T371742)', diff saved to https://phabricator.wikimedia.org/P70046 and previous config saved to /var/cache/conftool/dbconfig/20241015-213203-ladsgroup.json [21:32:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [21:32:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [21:32:23] (03CR) 10Dzahn: [C:03+2] "https://codesearch.wmcloud.org/_health/ now shows only "up" and working ports" [puppet] - 10https://gerrit.wikimedia.org/r/1079382 (owner: 10Krinkle) [21:32:23] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:32:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T371742)', diff saved to https://phabricator.wikimedia.org/P70047 and previous config saved to /var/cache/conftool/dbconfig/20241015-213227-ladsgroup.json [21:33:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P70048 and previous config saved to /var/cache/conftool/dbconfig/20241015-213305-ladsgroup.json [21:34:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [21:34:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [21:34:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T376905)', diff saved to https://phabricator.wikimedia.org/P70049 and previous config saved to /var/cache/conftool/dbconfig/20241015-213423-ladsgroup.json [21:37:38] (03CR) 10Scott French: "Thanks, Hugh!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1077682 (https://phabricator.wikimedia.org/T371699) (owner: 10Hnowlan) [21:43:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P70050 and previous config saved to /var/cache/conftool/dbconfig/20241015-214342-ladsgroup.json [21:43:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T376905)', diff saved to https://phabricator.wikimedia.org/P70051 and previous config saved to /var/cache/conftool/dbconfig/20241015-214350-ladsgroup.json [21:48:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P70052 and previous config saved to /var/cache/conftool/dbconfig/20241015-214811-ladsgroup.json [21:50:34] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10231682 (10Jclark-ctr) @fnegri sorry for lack of updates. I have been back and forth with firmware updates multiple times with dell it is in progress. [21:51:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bullseye [21:51:24] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10231684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye execute... [21:52:14] 06SRE, 10Wikimedia-Portals, 13Patch-For-Review: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10231695 (10Jclark-ctr) [21:54:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10231712 (10phaultfinder) [21:57:42] 06SRE, 06serviceops, 10Wikimedia-Apache-configuration, 10Wikimedia-Portals, 13Patch-For-Review: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10231718 (10Pppery) [21:58:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P70053 and previous config saved to /var/cache/conftool/dbconfig/20241015-215849-ladsgroup.json [21:58:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P70054 and previous config saved to /var/cache/conftool/dbconfig/20241015-215857-ladsgroup.json [22:03:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P70055 and previous config saved to /var/cache/conftool/dbconfig/20241015-220316-ladsgroup.json [22:10:55] (03CR) 10JHathaway: redfish: add UEFI functions (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [22:12:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10231748 (10Papaul) @MatthewVernon ms-be2082 is failing with the error below: ` Error: Could not retrieve catalog from remote server: Error 500 on SERVER:... [22:13:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T370903)', diff saved to https://phabricator.wikimedia.org/P70056 and previous config saved to /var/cache/conftool/dbconfig/20241015-221356-ladsgroup.json [22:14:01] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:14:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P70057 and previous config saved to /var/cache/conftool/dbconfig/20241015-221404-ladsgroup.json [22:22:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1189.eqiad.wmnet with reason: Maintenance [22:22:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1189.eqiad.wmnet with reason: Maintenance [22:25:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1222.eqiad.wmnet with reason: Maintenance [22:25:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1222.eqiad.wmnet with reason: Maintenance [22:29:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T376905)', diff saved to https://phabricator.wikimedia.org/P70058 and previous config saved to /var/cache/conftool/dbconfig/20241015-222911-ladsgroup.json [22:29:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [22:29:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [22:29:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T376905)', diff saved to https://phabricator.wikimedia.org/P70059 and previous config saved to /var/cache/conftool/dbconfig/20241015-222936-ladsgroup.json [22:38:13] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [22:39:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T376905)', diff saved to https://phabricator.wikimedia.org/P70060 and previous config saved to /var/cache/conftool/dbconfig/20241015-223902-ladsgroup.json [22:48:32] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [22:52:51] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Disk (sdk) failed on moss-be1002 - https://phabricator.wikimedia.org/T377154#10231891 (10Jclark-ctr) a:03Jclark-ctr This server is out of warranty we will replace drive tomorrow while on site I have updated Idrac firmware while i was logged i... [22:54:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P70061 and previous config saved to /var/cache/conftool/dbconfig/20241015-225409-ladsgroup.json [22:54:42] (03PS1) 10Cwhite: ci: capture job completion timer metrics [puppet] - 10https://gerrit.wikimedia.org/r/1080400 (https://phabricator.wikimedia.org/T233089) [23:04:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T371742)', diff saved to https://phabricator.wikimedia.org/P70062 and previous config saved to /var/cache/conftool/dbconfig/20241015-230456-ladsgroup.json [23:05:00] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:09:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P70063 and previous config saved to /var/cache/conftool/dbconfig/20241015-230916-ladsgroup.json [23:09:31] (03CR) 10Scott French: [C:03+1] "Thank you! Two last clarity comments (one optional, one ... recommended), but otherwise looks good." [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [23:20:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P70064 and previous config saved to /var/cache/conftool/dbconfig/20241015-232003-ladsgroup.json [23:24:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T376905)', diff saved to https://phabricator.wikimedia.org/P70065 and previous config saved to /var/cache/conftool/dbconfig/20241015-232423-ladsgroup.json [23:24:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [23:24:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [23:24:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:24:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:24:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T376905)', diff saved to https://phabricator.wikimedia.org/P70066 and previous config saved to /var/cache/conftool/dbconfig/20241015-232456-ladsgroup.json [23:28:52] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed on ms-be1075 - https://phabricator.wikimedia.org/T377109#10231949 (10Jclark-ctr) a:03Jclark-ctr Opened ticket with dell Confirmed: Service Request 199342082 was successfully submitted. [23:30:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T376905)', diff saved to https://phabricator.wikimedia.org/P70067 and previous config saved to /var/cache/conftool/dbconfig/20241015-233043-ladsgroup.json [23:33:58] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdv) failed on ms-be1065 - https://phabricator.wikimedia.org/T376775#10231956 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr @jcrespo yes we do have spare 8tb drives am i able to change in the morning? also did update idrac firmware whi... [23:35:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P70068 and previous config saved to /var/cache/conftool/dbconfig/20241015-233510-ladsgroup.json [23:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1080412 [23:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1080412 (owner: 10TrainBranchBot) [23:45:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P70069 and previous config saved to /var/cache/conftool/dbconfig/20241015-234550-ladsgroup.json [23:50:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T371742)', diff saved to https://phabricator.wikimedia.org/P70070 and previous config saved to /var/cache/conftool/dbconfig/20241015-235017-ladsgroup.json [23:50:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [23:50:22] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:50:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [23:50:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:50:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:50:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T371742)', diff saved to https://phabricator.wikimedia.org/P70071 and previous config saved to /var/cache/conftool/dbconfig/20241015-235055-ladsgroup.json