[00:02:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:22] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [00:06:43] FIRING: [2x] KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [00:08:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1195787 [00:08:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1195787 (owner: 10TrainBranchBot) [00:28:14] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1195787 (owner: 10TrainBranchBot) [00:42:22] (03PS4) 10Jon Harald Søby: Remove artifact from Quechua Wikipedia wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195789 [00:42:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195789 (owner: 10Jon Harald Søby) [00:43:55] (03PS5) 10Jon Harald Søby: Remove artifact from Quechua Wikipedia wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195789 [00:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:00:47] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:07:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.23 [core] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1195791 (https://phabricator.wikimedia.org/T405679) [01:07:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.23 [core] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1195791 (https://phabricator.wikimedia.org/T405679) (owner: 10TrainBranchBot) [01:14:08] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 20s) [01:21:55] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.23 [core] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1195791 (https://phabricator.wikimedia.org/T405679) (owner: 10TrainBranchBot) [01:32:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:39:54] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host mwlog1002.eqiad.wmnet [01:45:57] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog1002.eqiad.wmnet [01:47:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:52:14] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host mwlog2002.codfw.wmnet [01:53:36] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:58:32] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog2002.codfw.wmnet [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T0200) [02:05:54] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1003.eqiad.wmnet [02:09:43] FIRING: SLOMetricAbsent: charts-client-side-availability-v1 - https://slo.wikimedia.org/?search=charts-client-side-availability-v1 - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:09:54] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1003.eqiad.wmnet [02:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:20:16] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2003.codfw.wmnet [02:24:16] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2003.codfw.wmnet [02:32:33] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:44:43] RESOLVED: SLOMetricAbsent: charts-client-side-availability-v1 - https://slo.wikimedia.org/?search=charts-client-side-availability-v1 - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T0300) [03:02:09] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195795 (https://phabricator.wikimedia.org/T405679) [03:02:12] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195795 (https://phabricator.wikimedia.org/T405679) (owner: 10TrainBranchBot) [03:03:02] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195795 (https://phabricator.wikimedia.org/T405679) (owner: 10TrainBranchBot) [03:03:32] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.45.0-wmf.23 refs T405679 [03:03:35] T405679: 1.45.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T405679 [03:38:27] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:48:34] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.45.0-wmf.23 refs T405679 (duration: 45m 02s) [03:48:38] T405679: 1.45.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T405679 [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T0400) [04:02:46] !log mwpresync@deploy2002 Pruned MediaWiki: 1.45.0-wmf.20 (duration: 02m 42s) [04:06:43] FIRING: [2x] KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [04:41:08] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1033 gradually with 4 steps - Pool es1033.eqiad.wmnet in after cloning [04:46:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:46:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:48:36] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:50:45] (03PS1) 10Marostegui: db1221: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195800 (https://phabricator.wikimedia.org/T406541) [04:52:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 14 hosts with reason: Upgrading [04:52:23] (03CR) 10Marostegui: [C:03+2] db1221: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195800 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [04:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:53:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1221.eqiad.wmnet with reason: Maintenance [04:53:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1221 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83824 and previous config saved to /var/cache/conftool/dbconfig/20251014-045305-marostegui.json [05:00:40] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1195805 (https://phabricator.wikimedia.org/T407176) [05:01:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1221 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83826 and previous config saved to /var/cache/conftool/dbconfig/20251014-050113-root.json [05:04:16] (03CR) 10Marostegui: [C:03+1] db1176.yaml: major MariaDB version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1195706 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [05:05:35] (03CR) 10Marostegui: [C:03+1] site.pp: Remove es2052 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1194979 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [05:07:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11270135 (10Marostegui) ` es1031 C3 es1032 C3 es1033 D8 es1034 D8 ` Can be ignored, will be decommissioned in a couple of weeks (T406690) [05:08:27] FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1031-1032].eqiad.wmnet with reason: Cloning [05:16:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1221 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83828 and previous config saved to /var/cache/conftool/dbconfig/20251014-051619-root.json [05:16:30] (03PS1) 10Marostegui: mariadb: Productionize es1054 [puppet] - 10https://gerrit.wikimedia.org/r/1195816 (https://phabricator.wikimedia.org/T406488) [05:17:10] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1054 [puppet] - 10https://gerrit.wikimedia.org/r/1195816 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [05:20:48] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone_es of es1031.eqiad.wmnet onto es1054.eqiad.wmnet [05:20:52] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool es1031 - Depool es1031.eqiad.wmnet to then clone it to es1054.eqiad.wmnet - marostegui@cumin1003 [05:24:12] (03PS1) 10Marostegui: es1050: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1195822 (https://phabricator.wikimedia.org/T406488) [05:24:55] (03CR) 10Marostegui: [C:03+2] es1050: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1195822 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [05:25:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1050 (re)pooling @ 1%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83830 and previous config saved to /var/cache/conftool/dbconfig/20251014-052508-root.json [05:25:14] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [05:26:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1033 gradually with 4 steps - Pool es1033.eqiad.wmnet in after cloning [05:26:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es1033.eqiad.wmnet onto es1056.eqiad.wmnet [05:27:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es1031 - Depool es1031.eqiad.wmnet to then clone it to es1054.eqiad.wmnet - marostegui@cumin1003 [05:28:22] (03PS1) 10Marostegui: es1053: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1195823 (https://phabricator.wikimedia.org/T406488) [05:28:59] (03CR) 10Marostegui: [C:03+2] es1053: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1195823 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [05:29:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1053 (re)pooling @ 1%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83832 and previous config saved to /var/cache/conftool/dbconfig/20251014-052926-root.json [05:31:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1221 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83833 and previous config saved to /var/cache/conftool/dbconfig/20251014-053125-root.json [05:33:27] FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:36:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1160 with weight 0 T407176', diff saved to https://phabricator.wikimedia.org/P83834 and previous config saved to /var/cache/conftool/dbconfig/20251014-053654-marostegui.json [05:36:59] T407176: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T407176 [05:37:18] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T407176 [05:37:40] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1195805 (https://phabricator.wikimedia.org/T407176) (owner: 10Gerrit maintenance bot) [05:38:27] FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:36] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:39:22] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:39:36] (03PS1) 10Marostegui: db1244: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195826 (https://phabricator.wikimedia.org/T406541) [05:40:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1050 (re)pooling @ 5%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83835 and previous config saved to /var/cache/conftool/dbconfig/20251014-054014-root.json [05:40:19] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [05:41:02] !log Starting s4 eqiad failover from db1244 to db1160 - T407176 [05:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:41:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1160 to s4 primary T407176', diff saved to https://phabricator.wikimedia.org/P83836 and previous config saved to /var/cache/conftool/dbconfig/20251014-054118-marostegui.json [05:41:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:42:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1244 T407176', diff saved to https://phabricator.wikimedia.org/P83837 and previous config saved to /var/cache/conftool/dbconfig/20251014-054200-marostegui.json [05:42:05] T407176: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T407176 [05:42:54] (03CR) 10Marostegui: [C:03+2] db1244: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195826 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [05:43:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:43:36] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:43:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:43:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1244.eqiad.wmnet with reason: Maintenance [05:44:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1053 (re)pooling @ 5%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83838 and previous config saved to /var/cache/conftool/dbconfig/20251014-054432-root.json [05:46:07] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1195827 (https://phabricator.wikimedia.org/T407177) [05:46:12] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1195828 (https://phabricator.wikimedia.org/T407177) [05:46:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1221 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83839 and previous config saved to /var/cache/conftool/dbconfig/20251014-054631-root.json [05:47:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:49:07] (03PS1) 10Marostegui: mariadb: Productionize es1055 [puppet] - 10https://gerrit.wikimedia.org/r/1195829 (https://phabricator.wikimedia.org/T406488) [05:49:42] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1055 [puppet] - 10https://gerrit.wikimedia.org/r/1195829 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [05:52:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1244 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83840 and previous config saved to /var/cache/conftool/dbconfig/20251014-055206-root.json [05:53:34] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone_es of es1032.eqiad.wmnet onto es1055.eqiad.wmnet [05:53:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool es1032 - Depool es1032.eqiad.wmnet to then clone it to es1055.eqiad.wmnet - marostegui@cumin1003 [05:55:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1050 (re)pooling @ 7%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83842 and previous config saved to /var/cache/conftool/dbconfig/20251014-055520-root.json [05:55:25] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [05:58:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:58:36] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:58:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:59:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1053 (re)pooling @ 7%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83843 and previous config saved to /var/cache/conftool/dbconfig/20251014-055938-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T0600). [06:04:33] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:05:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195409 (https://phabricator.wikimedia.org/T402366) (owner: 10Kosta Harlan) [06:07:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1244 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83844 and previous config saved to /var/cache/conftool/dbconfig/20251014-060712-root.json [06:09:59] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:10:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1050 (re)pooling @ 10%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83845 and previous config saved to /var/cache/conftool/dbconfig/20251014-061026-root.json [06:10:31] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:11:30] (03PS1) 10Phuedx: ext-EventLogging: Allowlist product_metrics.web_base_with_ip stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195830 (https://phabricator.wikimedia.org/T406332) [06:14:19] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudweb2002-dev.wikimedia.org [06:14:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es1032 - Depool es1032.eqiad.wmnet to then clone it to es1055.eqiad.wmnet - marostegui@cumin1003 [06:14:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1053 (re)pooling @ 10%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83846 and previous config saved to /var/cache/conftool/dbconfig/20251014-061444-root.json [06:16:17] FIRING: ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:21:09] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb2002-dev.wikimedia.org [06:21:17] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:22:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1244 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83847 and previous config saved to /var/cache/conftool/dbconfig/20251014-062218-root.json [06:25:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1050 (re)pooling @ 20%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83848 and previous config saved to /var/cache/conftool/dbconfig/20251014-062532-root.json [06:25:37] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:29:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1053 (re)pooling @ 20%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83850 and previous config saved to /var/cache/conftool/dbconfig/20251014-062949-root.json [06:31:28] RESOLVED: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [06:32:33] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:37:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1244 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83851 and previous config saved to /var/cache/conftool/dbconfig/20251014-063724-root.json [06:40:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1050 (re)pooling @ 25%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83852 and previous config saved to /var/cache/conftool/dbconfig/20251014-064038-root.json [06:40:42] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:44:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1053 (re)pooling @ 25%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83853 and previous config saved to /var/cache/conftool/dbconfig/20251014-064455-root.json [06:49:49] (03CR) 10Elukey: "The last step before merging is to verify with olly if the expressions added for HTTP metrics may lead to an excessive label cardinality a" [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [06:51:26] (03PS1) 10Superpes15: [eswiktionary] Create a Tesauro namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195834 (https://phabricator.wikimedia.org/T407150) [06:51:36] (03CR) 10Elukey: [C:03+2] admin_ng: deploy the cluster role for the GPU node labeller to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195709 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [06:51:51] (03PS2) 10Superpes15: [enwikibooks] Set $wgAutoConfirmAge to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195770 (https://phabricator.wikimedia.org/T407080) [06:52:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193093 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [06:53:09] (03PS4) 10DCausse: cirrus: test completion with default sort on simplewiki [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193093 (https://phabricator.wikimedia.org/T404858) [06:55:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1050 (re)pooling @ 30%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83854 and previous config saved to /var/cache/conftool/dbconfig/20251014-065544-root.json [06:55:49] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:59:59] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:00:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1053 (re)pooling @ 30%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83855 and previous config saved to /var/cache/conftool/dbconfig/20251014-070001-root.json [07:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T0700) [07:00:05] joelyrookewmde, Jhs, dcausse, and Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:14] present [07:00:35] o/ [07:01:08]  hi! [07:02:10] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:06:55] No one around for deployment then? [07:07:06] o/ [07:07:17] sorry I'm a bit late, I can deploy [07:08:02] oh nice, thanks! [07:09:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195830 (https://phabricator.wikimedia.org/T406332) (owner: 10Phuedx) [07:10:01] ^ If that'll fit in the window, that'd be great [07:10:25] phuedx: sure [07:10:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1050 (re)pooling @ 50%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83856 and previous config saved to /var/cache/conftool/dbconfig/20251014-071050-root.json [07:10:54] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:11:58] Jhs: o/ I was checking the logo locally, I think I'll start with your patch while joelyrookewmde's one goes through CI [07:12:59] Jhs: are you still around? [07:13:36] dcausse, i'm around, yeah [07:13:37] works for me! Although I'm curious which CI my patch has to go through? [07:14:12] joelyrookewmde: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1195755 is for Wikibase and CI is generally a lot slower than mw-config [07:14:20] (03PS1) 10Superpes15: [kawiki] Enable NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195909 (https://phabricator.wikimedia.org/T407076) [07:14:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195789 (owner: 10Jon Harald Søby) [07:15:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1053 (re)pooling @ 50%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83857 and previous config saved to /var/cache/conftool/dbconfig/20251014-071507-root.json [07:15:14] (03CR) 10DCausse: [C:03+2] Implement new usage types for statement with qualifiers and references [extensions/Wikibase] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195755 (https://phabricator.wikimedia.org/T401290) (owner: 10Joely Rooke WMDE) [07:15:53] (03Merged) 10jenkins-bot: Remove artifact from Quechua Wikipedia wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195789 (owner: 10Jon Harald Søby) [07:16:43] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1195789|Remove artifact from Quechua Wikipedia wordmark]] [07:18:14] ahhhh the post approval one. nice, thanks : ) [07:18:16] (03PS2) 10Superpes15: [kawiki] Enable NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195909 (https://phabricator.wikimedia.org/T407076) [07:19:20] dcausse Please note that, at your preferences, my 3 patches can be merged together :) [07:19:37] Superpes: sure! [07:21:17] !log dcausse@deploy2002 jhsoby, dcausse: Backport for [[gerrit:1195789|Remove artifact from Quechua Wikipedia wordmark]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:21:30] Jhs: should be ready for testing [07:21:34] (03PS3) 10Superpes15: [enwikibooks] Set $wgAutoConfirmCount to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195770 (https://phabricator.wikimedia.org/T407080) [07:21:49] dcausse, aye, just did – looks good 👍 [07:22:00] ok, shipping [07:22:08] thx [07:22:13] !log dcausse@deploy2002 jhsoby, dcausse: Continuing with sync [07:23:49] (03PS1) 10Majavah: team-sre: keyholder: Send alert to role owner [alerts] - 10https://gerrit.wikimedia.org/r/1195945 [07:24:08] (03PS3) 10Superpes15: [kawiki] Enable NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195909 (https://phabricator.wikimedia.org/T407076) [07:24:39] (03CR) 10Filippo Giunchedi: [C:03+1] team-sre: keyholder: Send alert to role owner [alerts] - 10https://gerrit.wikimedia.org/r/1195945 (owner: 10Majavah) [07:25:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1050 (re)pooling @ 60%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83858 and previous config saved to /var/cache/conftool/dbconfig/20251014-072556-root.json [07:26:00] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:26:11] dcausse another quick q since I've not done so much cherry picking before: 1195755 is part of the change 1185060 which i realised has to go out before the train so that there aren't backwards compatibility issues during train deployment. but both commits have the same Change-Id. Is that problematic? [07:26:29] (03PS2) 10Majavah: team-sre: keyholder: Send alert to role owner [alerts] - 10https://gerrit.wikimedia.org/r/1195945 [07:26:56] joelyrookewmde: looking [07:27:11] joelyrookewmde: not as they are on different branches [07:27:51] ok, good to know! thanks [07:28:02] joelyrookewmde: if you backport does not depend on a particular change in mw-core or another extension it should be good [07:28:29] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195789|Remove artifact from Quechua Wikipedia wordmark]] (duration: 11m 46s) [07:28:41] clearing the cache for the logo [07:29:32] nice! [07:29:40] Jhs: should be live, the logo looks good to me [07:29:48] (03Merged) 10jenkins-bot: Implement new usage types for statement with qualifiers and references [extensions/Wikibase] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195755 (https://phabricator.wikimedia.org/T401290) (owner: 10Joely Rooke WMDE) [07:30:07] just in time [07:30:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1053 (re)pooling @ 60%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83859 and previous config saved to /var/cache/conftool/dbconfig/20251014-073013-root.json [07:30:40] hahah perfect [07:32:15] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1195755|Implement new usage types for statement with qualifiers and references (T401290)]] [07:32:19] T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290 [07:33:26] (03CR) 10Majavah: [C:03+2] team-sre: keyholder: Send alert to role owner [alerts] - 10https://gerrit.wikimedia.org/r/1195945 (owner: 10Majavah) [07:33:48] (03CR) 10DCausse: [C:03+1] [enwikibooks] Set $wgAutoConfirmCount to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195770 (https://phabricator.wikimedia.org/T407080) (owner: 10Superpes15) [07:34:36] (03Merged) 10jenkins-bot: team-sre: keyholder: Send alert to role owner [alerts] - 10https://gerrit.wikimedia.org/r/1195945 (owner: 10Majavah) [07:34:43] (03CR) 10DCausse: [C:03+1] [eswiktionary] Create a Tesauro namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195834 (https://phabricator.wikimedia.org/T407150) (owner: 10Superpes15) [07:36:36] !log dcausse@deploy2002 joelyrookewmde, dcausse: Backport for [[gerrit:1195755|Implement new usage types for statement with qualifiers and references (T401290)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:36:59] ^nothing to test [07:37:06] ok [07:39:05] !log dcausse@deploy2002 joelyrookewmde, dcausse: Continuing with sync [07:39:47] dcausse: hi! I have a security patch for T405859 that I'd like to deploy in a bit... Can you help me out with that? I haven't done a security deployment in years [07:41:00] duesen: hey! I'm afraid I'm not very knowledgeable on that front :( [07:41:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1050 (re)pooling @ 75%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83860 and previous config saved to /var/cache/conftool/dbconfig/20251014-074102-root.json [07:41:07] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:41:56] dcausse: the instructions are clear enough, I think I can manage, but I would feel better to have someone around in case something goes wrong... [07:42:19] sure [07:42:37] jouncebot: next [07:42:38] In 0 hour(s) and 17 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T0800) [07:42:52] excellent! I need about 10 minutes or so... let me know when oyu are done with regular deployments. [07:43:04] sounds good [07:43:05] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195755|Implement new usage types for statement with qualifiers and references (T401290)]] (duration: 10m 50s) [07:43:09] T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290 [07:43:23] (03CR) 10DCausse: [C:03+1] [kawiki] Enable NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195909 (https://phabricator.wikimedia.org/T407076) (owner: 10Superpes15) [07:43:23] Amir1: are you around? You wrote the script for security patch deployment, right? Is it still the best way to deploy? [07:43:38] joelyrookewmde: should be live [07:43:59] Superpes: going to ship your patches [07:44:05] ty ty [07:44:13] dcausse Thanks :) [07:45:04] Hi, I am around and I will run the train after the backport window has completed :) [07:45:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1053 (re)pooling @ 75%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83861 and previous config saved to /var/cache/conftool/dbconfig/20251014-074519-root.json [07:46:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195770 (https://phabricator.wikimedia.org/T407080) (owner: 10Superpes15) [07:46:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195834 (https://phabricator.wikimedia.org/T407150) (owner: 10Superpes15) [07:46:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195909 (https://phabricator.wikimedia.org/T407076) (owner: 10Superpes15) [07:46:50] (03Merged) 10jenkins-bot: [enwikibooks] Set $wgAutoConfirmCount to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195770 (https://phabricator.wikimedia.org/T407080) (owner: 10Superpes15) [07:46:59] (03Merged) 10jenkins-bot: [eswiktionary] Create a Tesauro namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195834 (https://phabricator.wikimedia.org/T407150) (owner: 10Superpes15) [07:47:01] (03Merged) 10jenkins-bot: [kawiki] Enable NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195909 (https://phabricator.wikimedia.org/T407076) (owner: 10Superpes15) [07:47:34] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1195770|[enwikibooks] Set $wgAutoConfirmCount to 5 (T407080)]], [[gerrit:1195834|[eswiktionary] Create a Tesauro namespace (T407150)]], [[gerrit:1195909|[kawiki] Enable NewUserMessage extension (T407076)]] [07:47:41] T407080: Modify the autoconfirmed user group to add an edit count on English Wikibooks - https://phabricator.wikimedia.org/T407080 [07:47:42] T407150: Add "Tesauro:" and "Tesauro discusión:" namespaces to eswiktionary - https://phabricator.wikimedia.org/T407150 [07:47:42] T407076: Enable Extension NewUserMessage on ka.wikipedia - https://phabricator.wikimedia.org/T407076 [07:48:02] (03CR) 10Brouberol: [C:03+1] flink-operator: align mem settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194735 (https://phabricator.wikimedia.org/T405361) (owner: 10DCausse) [07:51:29] hashar: hey, can I squeeze in a security patch before you run the train? [07:51:53] !log dcausse@deploy2002 dcausse, superpes: Backport for [[gerrit:1195770|[enwikibooks] Set $wgAutoConfirmCount to 5 (T407080)]], [[gerrit:1195834|[eswiktionary] Create a Tesauro namespace (T407150)]], [[gerrit:1195909|[kawiki] Enable NewUserMessage extension (T407076)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:52:28] Testing [07:52:31] thanks [07:53:20] hashar: we'll be late with the backport window :( [07:53:27] dcausse They look fine :) [07:53:37] Superpes: nice, shipping [07:54:44] !log dcausse@deploy2002 dcausse, superpes: Continuing with sync [07:55:42] phuedx: do you mind if I ship your patch alongside with mine? [07:56:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1050 (re)pooling @ 100%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83862 and previous config saved to /var/cache/conftool/dbconfig/20251014-075608-root.json [07:56:12] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:57:20] dcausse: Go for it. Mine should be low risk :) [07:57:43] thanks! [07:59:03] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195770|[enwikibooks] Set $wgAutoConfirmCount to 5 (T407080)]], [[gerrit:1195834|[eswiktionary] Create a Tesauro namespace (T407150)]], [[gerrit:1195909|[kawiki] Enable NewUserMessage extension (T407076)]] (duration: 11m 29s) [07:59:11] T407080: Modify the autoconfirmed user group to add an edit count on English Wikibooks - https://phabricator.wikimedia.org/T407080 [07:59:11] T407150: Add "Tesauro:" and "Tesauro discusión:" namespaces to eswiktionary - https://phabricator.wikimedia.org/T407150 [07:59:12] T407076: Enable Extension NewUserMessage on ka.wikipedia - https://phabricator.wikimedia.org/T407076 [07:59:27] running namespaceDup [07:59:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:00:05] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T0800) [08:00:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:00:16] dcausse: no worries :) [08:00:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1053 (re)pooling @ 100%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83863 and previous config saved to /var/cache/conftool/dbconfig/20251014-080025-root.json [08:00:27] Superpes: should be live, I'll take care of namespaceDupe while other patches are being shipped [08:01:06] duesen: can we do the security patch this afternoon? after the train I gotta switch over Gerrit with SRE [08:01:15] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:01:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193093 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [08:01:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195830 (https://phabricator.wikimedia.org/T406332) (owner: 10Phuedx) [08:01:40] hashar: hm... when in the afternoon? I have a bunch of meetings coming up [08:01:40] dcausse Wonderful! Thanks for your assistance :3 [08:01:50] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:02:10] duesen: well lets go for it after the backport has completed so [08:02:24] Superpes: yw! :) [08:02:29] !log dcausse@deploy2002 mwscript-k8s job started: namespaceDupes eswiktionary --fix # T407150 [08:02:54] (03Merged) 10jenkins-bot: cirrus: test completion with default sort on simplewiki [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193093 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [08:03:00] (03Merged) 10jenkins-bot: ext-EventLogging: Allowlist product_metrics.web_base_with_ip stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195830 (https://phabricator.wikimedia.org/T406332) (owner: 10Phuedx) [08:03:32] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1193093|cirrus: test completion with default sort on simplewiki [3/3] (T404858)]], [[gerrit:1195830|ext-EventLogging: Allowlist product_metrics.web_base_with_ip stream (T406332)]] [08:03:38] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:03:38] T406332: Make XLAB_STREAMS allowlist configurable - https://phabricator.wikimedia.org/T406332 [08:03:54] hashar: I should deploy for .22 and also .23, right? [08:05:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11270484 (10Marostegui) ` es1045 C5 es1046 D6 es1051 D1 es1052 D3 es1053 D6 es1057 C3 ` Can be done anytime, just need downtime [08:05:45] duesen: correct [08:06:03] where do I test .23? [08:06:06] you can see the versions https://tools.wmflabs.org/versions/ [08:06:19] .23 is on test.wikipedia.org already [08:06:32] (03CR) 10Federico Ceratto: [C:03+2] site.pp: Remove es2052 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1194979 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [08:06:37] (it does not show up on the tool, we would need to add a "Group test") [08:06:41] before Group 0 [08:06:53] yea, that would be useful [08:07:05] (03CR) 10Federico Ceratto: [C:03+2] db1176.yaml: major MariaDB version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1195706 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [08:07:50] !log dcausse@deploy2002 dcausse, phuedx: Backport for [[gerrit:1193093|cirrus: test completion with default sort on simplewiki [3/3] (T404858)]], [[gerrit:1195830|ext-EventLogging: Allowlist product_metrics.web_base_with_ip stream (T406332)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:08:02] phuedx: please let me know if you need to test anything [08:08:10] dcausse: Yes. Looking [08:09:58] dcausse: LGTM [08:10:06] ok, shipping [08:10:10] !log dcausse@deploy2002 dcausse, phuedx: Continuing with sync [08:11:38] hashar: I could do the deployment during the regular afternoon deployment window around 15:00 CEST if you prefer. This window is already quite long, and I have a meeting coming up at 11. [08:12:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet [08:12:32] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1074.eqiad.wmnet [08:12:45] ...let me know what you prefer [08:12:57] (03PS1) 10Marostegui: mariadb: Productionize db1260 [puppet] - 10https://gerrit.wikimedia.org/r/1195977 (https://phabricator.wikimedia.org/T406550) [08:14:18] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193093|cirrus: test completion with default sort on simplewiki [3/3] (T404858)]], [[gerrit:1195830|ext-EventLogging: Allowlist product_metrics.web_base_with_ip stream (T406332)]] (duration: 10m 46s) [08:14:25] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:14:25] T406332: Make XLAB_STREAMS allowlist configurable - https://phabricator.wikimedia.org/T406332 [08:15:54] phuedx: should be live [08:15:54] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db1260 [puppet] - 10https://gerrit.wikimedia.org/r/1195977 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui) [08:16:04] dcausse: Thanks. I'll monitor EventGate [08:16:05] federico3: ok to merge your change? [08:16:19] duesen: I'm done, do you still me? [08:16:25] *need [08:16:44] (03PS1) 10Brouberol: growthbook: remove sextant dependency files from the root chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195979 (https://phabricator.wikimedia.org/T406786) [08:16:57] (03CR) 10Federico Ceratto: [C:03+2] sanitize-wiki.py: Improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1191689 (https://phabricator.wikimedia.org/T366146) (owner: 10Federico Ceratto) [08:17:12] federico3: ping [08:17:25] duesen: lets do it this afternoon [08:17:33] I'll run the train [08:17:35] looking [08:17:37] ok [08:17:41] hashar I'm done with the deploys [08:18:18] !log closing the UTC morning backport window [08:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:21] duesen: I have added to myself to be there at 15:00 to assist you [08:18:24] dcausse: merci! [08:18:32] (03CR) 10Superpes15: "Please note that per https://w.wiki/FgE2 "the wordmark width should not exceed 124px"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195183 (owner: 10GergesShamon) [08:18:48] @marostegui I see a merged CR [08:19:03] federico3: You didn't run puppet-merge, can I do it and merge your change? [08:19:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet [08:20:19] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1074.eqiad.wmnet [08:20:44] looks like something else is going on [08:21:01] federico3: What? Can I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195706? [08:21:09] You merged it but didn't run puppet-merge, can I do it? [08:21:17] I see something locking the erging on puppetsever1001 [08:21:31] federico3: it is me waiting for your answer :) [08:21:35] (03CR) 10Brouberol: [C:03+1] opensearch on k8s: add service definitions [puppet] - 10https://gerrit.wikimedia.org/r/1195342 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [08:21:35] it could be hanging because gerrit had a short downtime [08:21:36] ah yes [08:21:39] ok [08:21:40] thanks [08:21:42] go ahead [08:21:45] doing it [08:23:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [08:23:55] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1075.eqiad.wmnet [08:25:52] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db1247.eqiad.wmnet onto db1260.eqiad.wmnet [08:25:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1247 - Depool db1247.eqiad.wmnet to then clone it to db1260.eqiad.wmnet - marostegui@cumin1003 [08:26:14] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1247 - Depool db1247.eqiad.wmnet to then clone it to db1260.eqiad.wmnet - marostegui@cumin1003 [08:27:02] I am running the train now [08:27:10] (03CR) 10Brouberol: [C:03+2] flink-operator: align mem settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194735 (https://phabricator.wikimedia.org/T405361) (owner: 10DCausse) [08:27:42] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195980 (https://phabricator.wikimedia.org/T405679) [08:27:45] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195980 (https://phabricator.wikimedia.org/T405679) (owner: 10TrainBranchBot) [08:28:42] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195980 (https://phabricator.wikimedia.org/T405679) (owner: 10TrainBranchBot) [08:29:12] !log brouberol@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:30:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet [08:30:53] !log brouberol@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:31:13] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1075.eqiad.wmnet [08:31:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet [08:31:51] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1076.eqiad.wmnet [08:32:52] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad [08:33:31] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [08:33:46] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:34:16] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:34:46] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-admin" and "deployment" groups for JavierMonton - https://phabricator.wikimedia.org/T407187 (10JMonton-WMF) 03NEW [08:35:30] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-admins" and "deployment" groups for JavierMonton - https://phabricator.wikimedia.org/T407187#11270688 (10JMonton-WMF) [08:37:07] !log brouberol@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:37:09] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1195713 (owner: 10Muehlenhoff) [08:37:23] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.23 refs T405679 [08:37:27] T405679: 1.45.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T405679 [08:38:08] !log brouberol@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:39:32] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1076.eqiad.wmnet [08:39:39] (03CR) 10Muehlenhoff: [C:03+2] Bitu: Add approval config for airflow-wikidata-ops [puppet] - 10https://gerrit.wikimedia.org/r/1195202 (https://phabricator.wikimedia.org/T405557) (owner: 10Muehlenhoff) [08:40:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:40:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad [08:41:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:41:44] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1077.eqiad.wmnet [08:42:23] !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:44:00] !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:44:38] (03CR) 10Brouberol: [C:03+1] Add the opensearch namespaces to the list of tenents for rbd in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195771 (https://phabricator.wikimedia.org/T397246) (owner: 10Btullis) [08:44:50] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:45:05] (03CR) 10Btullis: [C:03+2] Add the opensearch namespaces to the list of tenents for rbd in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195771 (https://phabricator.wikimedia.org/T397246) (owner: 10Btullis) [08:45:56] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:45:58] (03Abandoned) 10Btullis: Allow kerberos::systemd::timer to use a custom email sender [puppet] - 10https://gerrit.wikimedia.org/r/1007578 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [08:47:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet [08:48:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Hurricane Electric Transit/Peering circuit eqiad - https://phabricator.wikimedia.org/T407008#11270723 (10cmooney) >>! In T407008#11265835, @RobH wrote: > @cmooney: Shouldn't the circuit be 'decommissioning' status in netbox at th... [08:49:22] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet [08:50:00] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1077.eqiad.wmnet [08:50:03] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1078.eqiad.wmnet [08:50:08] (03CR) 10Btullis: [C:03+1] growthbook: remove sextant dependency files from the root chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195979 (https://phabricator.wikimedia.org/T406786) (owner: 10Brouberol) [08:52:07] (03CR) 10Elukey: [C:03+2] profile::amd_gpu: apply the node labeller to all k8s nodes with a GPU [puppet] - 10https://gerrit.wikimedia.org/r/1195708 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [08:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:52:39] (03Merged) 10jenkins-bot: Add the opensearch namespaces to the list of tenents for rbd in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195771 (https://phabricator.wikimedia.org/T397246) (owner: 10Btullis) [08:52:41] arnaudb: MetricsPlatform has some error https://phabricator.wikimedia.org/T407188 [08:52:51] I am checking with Sam whether that is terrible or not :) [08:52:57] ack, standing by [08:55:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet [08:55:51] arnaudb: I have clarified with him, that is not a train blocker [08:56:13] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2070.codfw.wmnet [08:56:19] ack thanks! [08:56:54] !log enable new inter.link IP transit circuit on cr1-drms T401104 [08:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:45] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1078.eqiad.wmnet [08:57:48] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1079.eqiad.wmnet [08:58:05] (03PS1) 10Brouberol: airflow-main: increase max_map_length from 1024 to 1200 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195985 (https://phabricator.wikimedia.org/T406371) [08:58:37] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:00:05] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T0800) [09:00:05] arnaudb and hashar: Deploy window Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T0900) [09:00:23] (03CR) 10Arnaudb: [C:03+2] Revert^4 "gerrit: switchover from gerrit1003 to gerrit2003" [dns] - 10https://gerrit.wikimedia.org/r/1194932 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:00:28] (03CR) 10Arnaudb: [C:03+2] Revert^4 "gerrit: Switchover gerrit1003 → gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1194931 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:00:36] !log arnaudb@dns1004 START - running authdns-update [09:01:25] FIRING: SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:02] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [09:02:07] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.topology-check (exit_code=0) Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [09:02:17] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.failover from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [09:02:50] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.read-only-toggle from gerrit1003.wikimedia.org [09:02:56] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.read-only-toggle (exit_code=0) from gerrit1003.wikimedia.org [09:04:07] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.read-only-toggle from gerrit2003.wikimedia.org [09:04:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2070.codfw.wmnet [09:04:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2071.codfw.wmnet [09:04:20] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.read-only-toggle (exit_code=0) from gerrit2003.wikimedia.org [09:05:14] why are there gerrit switch overs so often nowadays? I think this is the third in the span of a week or two [09:05:24] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1079.eqiad.wmnet [09:05:28] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1080.eqiad.wmnet [09:05:32] the first 2 sadly failed Jhs sorry about this [09:05:45] ah, i see [09:06:16] no worries. good to get things to work like they should, so keep at it :) [09:06:40] thanks for your understanding :) [09:09:22] FIRING: [7x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:23] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:10:11] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1080.eqiad.wmnet [09:10:15] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1081.eqiad.wmnet [09:11:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:11:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:11:08] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:11:08] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:11:25] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:26] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:11:36] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:11:36] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:11:40] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:11:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2071.codfw.wmnet [09:11:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2072.codfw.wmnet [09:11:54] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:12:10] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:54] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:12:58] !log fceratto@cumin1002 START - Cookbook sre.ganeti.makevm for new host db-test2002.codfw.wmnet [09:12:59] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [09:13:22] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:13:22] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:13:22] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:14:16] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:14:42] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:15:26] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:16:25] RESOLVED: [3x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:17:17] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test2002.codfw.wmnet - fceratto@cumin1002" [09:17:54] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test2002.codfw.wmnet - fceratto@cumin1002" [09:17:54] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:17:54] !log fceratto@cumin1002 START - Cookbook sre.dns.wipe-cache db-test2002.codfw.wmnet on all recursors [09:17:58] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db-test2002.codfw.wmnet on all recursors [09:18:28] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test2002.codfw.wmnet - fceratto@cumin1002" [09:18:33] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test2002.codfw.wmnet - fceratto@cumin1002" [09:18:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2072.codfw.wmnet [09:18:59] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2073.codfw.wmnet [09:19:22] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:25] !log fceratto@cumin1002 START - Cookbook sre.hosts.reimage for host db-test2002.codfw.wmnet with OS trixie [09:22:07] !log arnaudb@dns1004 END - running authdns-update [09:22:15] !log arnaudb@cumin1003 START - Cookbook sre.dns.wipe-cache gerrit.wikimedia.org gerrit-replica.wikimedia.org on all recursors [09:22:19] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) gerrit.wikimedia.org gerrit-replica.wikimedia.org on all recursors [09:22:55] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:16] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:24:22] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:22] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:24:42] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:25:26] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:25:52] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.failover (exit_code=99) from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [09:26:06] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:26:08] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:26:10] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:26:10] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:26:26] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:26:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2073.codfw.wmnet [09:26:36] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2074.codfw.wmnet [09:26:36] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:26:38] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:26:40] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:26:54] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:27:10] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:27:54] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:27:55] FIRING: [5x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:28:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:28:24] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:28:27] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:29:22] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:20] PROBLEM - Host ms-be1081 is DOWN: PING CRITICAL - Packet loss = 100% [09:33:08] !log fceratto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test2002.codfw.wmnet with reason: host reimage [09:34:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2074.codfw.wmnet [09:34:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2075.codfw.wmnet [09:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:37:06] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [09:38:27] RESOLVED: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:38:49] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test2002.codfw.wmnet with reason: host reimage [09:41:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2075.codfw.wmnet [09:41:11] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2076.codfw.wmnet [09:44:22] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:10] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:48:27] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:49:42] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:50:28] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:50:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2076.codfw.wmnet [09:50:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2077.codfw.wmnet [09:51:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:51:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:51:08] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:51:08] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:51:26] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:51:36] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:51:38] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:51:40] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:51:54] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:52:09] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db-test2002.codfw.wmnet with OS trixie [09:52:10] !log fceratto@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db-test2002.codfw.wmnet [09:52:54] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:53:22] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:53:22] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:53:22] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:54:14] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [09:58:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2077.codfw.wmnet [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1000) [10:01:07] PROBLEM - gdnsd checkconf #page on dns1004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [10:01:22] fabfur: ^^ expected? [10:01:42] hey:) [10:02:07] RECOVERY - gdnsd checkconf #page on dns1004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [10:04:19] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11271042 (10Mvolz) >>! In T391852#11269205, @elukey wrote: > @Mvolz I added two new panels to https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid?... [10:04:21] !log mwscript-k8s --follow --dblist=group0 -- purgeUserOptions.php (T406724) [10:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:25] T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724 [10:09:56] !log fceratto@cumin1002 START - Cookbook sre.ganeti.makevm for new host db-test1002.eqiad.wmnet [10:09:58] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [10:12:09] I think we are not supposed to run dns or netbox cookbook [10:12:11] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11271062 (10elukey) >>! In T391852#11271042, @Mvolz wrote: >>>! In T391852#11269205, @elukey wrote: >> @Mvolz I added two new panels to https://gr... [10:15:19] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1002.eqiad.wmnet - fceratto@cumin1002" [10:15:58] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1081 reports no disks - controller failure? - https://phabricator.wikimedia.org/T407198 (10MatthewVernon) 03NEW [10:16:06] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1081 reports no disks - controller failure? - https://phabricator.wikimedia.org/T407198#11271087 (10MatthewVernon) p:05Triage→03High [10:16:34] !log mvernon@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be1081.eqiad.wmnet [10:16:39] Amir1: I agree, not sure what is the gerrit status though. arnaudb could you please update the chan? [10:17:24] federico3: o/ if possible wait a bit before running cookbook that may change DNS/etc.. [10:17:38] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1002.eqiad.wmnet - fceratto@cumin1002" [10:17:38] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:17:38] !log fceratto@cumin1002 START - Cookbook sre.dns.wipe-cache db-test1002.eqiad.wmnet on all recursors [10:17:42] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db-test1002.eqiad.wmnet on all recursors [10:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:17:47] see the private channels [10:18:04] I'm in the middle of the vm creation cookbook, want me to abort it? [10:18:09] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1002.eqiad.wmnet - fceratto@cumin1002" [10:18:15] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1002.eqiad.wmnet - fceratto@cumin1002" [10:20:01] federico3: I think you can leave it running, but check the private channel since they are working on the DNS infra atm [10:20:01] !log disabling puppet on all DNS hosts for manual gerrit switch (T407200) [10:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:08] yeah see --^ [10:21:16] fceratto@cumin1002 makevm (PID 3611363) is awaiting input [10:21:17] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:21:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2079.codfw.wmnet [10:22:10] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:11] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1082.eqiad.wmnet [10:27:10] !log fceratto@cumin1002 START - Cookbook sre.hosts.reimage for host db-test1002.eqiad.wmnet with OS trixie [10:27:10] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:28:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2079.codfw.wmnet [10:28:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2080.codfw.wmnet [10:29:22] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:31:17] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1082.eqiad.wmnet [10:31:21] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1083.eqiad.wmnet [10:32:30] 06SRE, 06Infrastructure-Foundations: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11271186 (10Ladsgroup) Adding the team responsible for access management and the SME in this topic. [10:32:33] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:34:11] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on es2053.codfw.wmnet with reason: Setting up new ES host [10:36:10] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:36:10] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:36:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2080.codfw.wmnet [10:36:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2081.codfw.wmnet [10:36:38] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:36:40] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:37:32] !log fceratto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test1002.eqiad.wmnet with reason: host reimage [10:37:54] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:38:19] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1083.eqiad.wmnet [10:38:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:38:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:38:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:38:23] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1084.eqiad.wmnet [10:38:27] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:39:16] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:39:22] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:39:42] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:39:43] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11271224 (10Mvolz) >>! In T391852#11271062, @elukey wrote: >>>! In T391852#11271042, @Mvolz wrote: >>>>! In T391852#11269205, @elukey wrote: >>> @... [10:40:28] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:41:02] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11271229 (10Ladsgroup) a:03thcipriani Assigning to reflect the reality [10:41:06] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:41:08] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:41:26] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:41:36] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:41:54] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [10:42:10] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:42:29] !log arnaudb@dns1004 START - running authdns-update [10:43:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2081.codfw.wmnet [10:43:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2082.codfw.wmnet [10:43:21] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test1002.eqiad.wmnet with reason: host reimage [10:43:48] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:43:49] !log arnaudb@dns1004 END - running authdns-update [10:44:00] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1084.eqiad.wmnet [10:44:04] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1085.eqiad.wmnet [10:46:26] (03PS1) 10JavierMonton: topic: Adding javiermonton to `analytics-admins` and `deployment` groups. [puppet] - 10https://gerrit.wikimedia.org/r/1196010 (https://phabricator.wikimedia.org/T407187) [10:48:11] !log enable puppet on all DNS hosts for manual gerrit switch (T407200) [10:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:25] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for JavierMonton - https://phabricator.wikimedia.org/T407187#11271273 (10JMonton-WMF) [10:49:04] !log Restarted Zuul to have it reconnect to Gerrit [10:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:45] (03PS2) 10JavierMonton: topic: Adding javiermonton to `analytics-admins` and `deployment` groups. [puppet] - 10https://gerrit.wikimedia.org/r/1196010 (https://phabricator.wikimedia.org/T407187) [10:50:06] (03CR) 10Federico Ceratto: clone_es.py: clone readonly es* hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto) [10:50:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2082.codfw.wmnet [10:50:32] (03CR) 10CI reject: [V:04-1] topic: Adding javiermonton to `analytics-admins` and `deployment` groups. [puppet] - 10https://gerrit.wikimedia.org/r/1196010 (https://phabricator.wikimedia.org/T407187) (owner: 10JavierMonton) [10:51:10] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1085.eqiad.wmnet [10:51:11] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:51:14] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1086.eqiad.wmnet [10:52:11] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:53:48] RESOLVED: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:54:23] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:55:11] (03PS3) 10JavierMonton: topic: Adding javiermonton to `analytics-admins` and `deployment` groups. [puppet] - 10https://gerrit.wikimedia.org/r/1196010 (https://phabricator.wikimedia.org/T407187) [10:55:30] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db-test1002.eqiad.wmnet with OS trixie [10:55:30] !log fceratto@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db-test1002.eqiad.wmnet [10:56:52] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:57:00] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2085.codfw.wmnet [10:58:31] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1086.eqiad.wmnet [10:58:34] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1087.eqiad.wmnet [11:03:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2085.codfw.wmnet [11:03:59] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2086.codfw.wmnet [11:04:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org [11:05:31] PROBLEM - Host cloudgw1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:05:52] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1087.eqiad.wmnet [11:05:55] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1088.eqiad.wmnet [11:07:01] RECOVERY - Host cloudgw1003 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [11:09:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org [11:11:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2086.codfw.wmnet [11:11:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2087.codfw.wmnet [11:11:45] (03PS3) 10Samtar: errorpage.html.erb: Use flex for page layout [puppet] - 10https://gerrit.wikimedia.org/r/1139049 (https://phabricator.wikimedia.org/T392692) [11:11:46] (03CR) 10Ladsgroup: [C:03+2] errorpage.html.erb: Use flex for page layout [puppet] - 10https://gerrit.wikimedia.org/r/1139049 (https://phabricator.wikimedia.org/T392692) (owner: 10Samtar) [11:11:47] (03CR) 10Ladsgroup: [V:03+2 C:03+2] errorpage.html.erb: Use flex for page layout [puppet] - 10https://gerrit.wikimedia.org/r/1139049 (https://phabricator.wikimedia.org/T392692) (owner: 10Samtar) [11:12:17] PROBLEM - Host cloudgw1004 is DOWN: PING CRITICAL - Packet loss = 100% [11:12:20] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1088.eqiad.wmnet [11:12:23] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1089.eqiad.wmnet [11:13:45] RECOVERY - Host cloudgw1004 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [11:16:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2087.codfw.wmnet [11:16:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet [11:18:57] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1089.eqiad.wmnet [11:19:00] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1090.eqiad.wmnet [11:22:32] 06SRE, 07CSS, 13Patch-For-Review: Update the errorpage template to use flex - https://phabricator.wikimedia.org/T392692#11271408 (10Ladsgroup) It would be really nice to improve 404 page too. [11:23:00] PROBLEM - Host cloudweb1003 #page is DOWN: PING CRITICAL - Packet loss = 100% [11:23:22] Amir1: dis u? [11:23:37] haha nope [11:23:42] bullseye reboots [11:23:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2088.codfw.wmnet [11:23:45] ack [11:23:47] at least not intentionally [11:23:48] RECOVERY - Host cloudweb1003 #page is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [11:23:49] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2089.codfw.wmnet [11:25:34] that's me, sorry about that, (and TIL that pages via Icinga :/) [11:25:48] np, thanks taavi [11:26:03] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1090.eqiad.wmnet [11:26:06] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1091.eqiad.wmnet [11:26:16] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11271431 (10Ladsgroup) [11:26:18] PROBLEM - Host cloudweb1004 #page is DOWN: PING CRITICAL - Packet loss = 100% [11:26:37] I sense a pattern lol [11:27:05] ACKNOWLEDGEMENT - Host cloudweb1004 #page is DOWN: PING CRITICAL - Packet loss = 100% Majavah reboot [11:27:23] thanks taavi [11:27:46] RECOVERY - Host cloudweb1004 #page is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [11:28:04] sre.hosts.reboot-single should avoid that by setting downtime and removing it after a complete puppet run of the rebooted host [11:28:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [11:28:45] the main problem is that the cookbook I was using for those is running on the cloudcumins, which can't access icinga due to $REASONS [11:28:55] so those were silenced on AM, but not on Icinga [11:28:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [11:29:33] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11271445 (10Ladsgroup) Hi, I'm on clinic duty this week. Regarding your ssh key, do you prefer not to use RSA? See https://security.stackexchange.com/questions/90077/ssh-key... [11:30:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [11:30:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [11:30:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2089.codfw.wmnet [11:31:22] (03CR) 10Marostegui: "It looks good to me, I've made some comments mostly related to the future of this cookbook." [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto) [11:31:40] 06SRE, 07CSS: Update the errorpage template to use flex - https://phabricator.wikimedia.org/T392692#11271464 (10Ladsgroup) 05Open→03Resolved a:03TheresNoTime [11:32:55] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1091.eqiad.wmnet [11:32:59] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1092.eqiad.wmnet [11:35:32] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11271479 (10Marostegui) [11:37:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11271483 (10Marostegui) [11:37:51] jouncebot: nowandnext [11:37:51] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [11:37:51] In 0 hour(s) and 22 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1200) [11:37:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [11:38:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [11:38:32] (03PS1) 10Ladsgroup: filebackend: Remove consistency check for multi-backend [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196018 (https://phabricator.wikimedia.org/T328872) [11:38:38] (03CR) 10Ladsgroup: [C:03+2] filebackend: Remove consistency check for multi-backend [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196018 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [11:39:03] (03PS1) 10Majavah: hieradata: Make cloudweb Icinga checks non-critical [puppet] - 10https://gerrit.wikimedia.org/r/1196019 (https://phabricator.wikimedia.org/T407208) [11:39:48] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1092.eqiad.wmnet [11:39:51] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1093.eqiad.wmnet [11:40:24] (03PS1) 10Federico Ceratto: site.pp: Remove es2053 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1196020 (https://phabricator.wikimedia.org/T402859) [11:40:26] (03PS1) 10Federico Ceratto: site.pp: Set role for es2053 [puppet] - 10https://gerrit.wikimedia.org/r/1196021 (https://phabricator.wikimedia.org/T402859) [11:41:17] (03CR) 10Marostegui: "Is the hieradata file generated too?" [puppet] - 10https://gerrit.wikimedia.org/r/1196021 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:41:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [11:41:32] (03CR) 10Marostegui: [C:03+1] site.pp: Remove es2053 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1196020 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:41:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [11:42:02] (03PS1) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [11:42:04] (03PS1) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [11:42:40] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7269/console" [puppet] - 10https://gerrit.wikimedia.org/r/1196019 (https://phabricator.wikimedia.org/T407208) (owner: 10Majavah) [11:43:26] (03PS1) 10Phuedx: ext.wikimediaEvents: simple-bot-detection: Use correct schema [extensions/WikimediaEvents] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196024 [11:43:36] (03PS1) 10Phuedx: ext.wikimediaEvents: simple-bot-detection: Use correct schema [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196025 [11:44:48] (03PS1) 10Brouberol: airflow: set the default DAG parsing interval to 300s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196026 (https://phabricator.wikimedia.org/T406371) [11:44:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196025 (owner: 10Phuedx) [11:44:57] (03CR) 10CI reject: [V:04-1] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [11:45:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196024 (owner: 10Phuedx) [11:45:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1247 gradually with 4 steps - Pool db1247.eqiad.wmnet in after cloning [11:46:32] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1093.eqiad.wmnet [11:46:36] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1094.eqiad.wmnet [11:47:27] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195664 (owner: 10PipelineBot) [11:47:45] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::k8s::haproxy: Log all requests [puppet] - 10https://gerrit.wikimedia.org/r/1195656 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [11:47:58] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194263 (owner: 10PipelineBot) [11:47:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org [11:49:11] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195664 (owner: 10PipelineBot) [11:49:39] (03PS1) 10Brouberol: airflow: allow the deployment of the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196028 (https://phabricator.wikimedia.org/T406958) [11:49:41] (03PS1) 10Brouberol: airflow-ml: enable the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196029 (https://phabricator.wikimedia.org/T406958) [11:50:23] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195236 (owner: 10PipelineBot) [11:50:25] (03CR) 10Federico Ceratto: "Yes, prepared before in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194644" [puppet] - 10https://gerrit.wikimedia.org/r/1196021 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:50:40] (03CR) 10Federico Ceratto: [C:03+2] site.pp: Remove es2053 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1196020 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:51:07] (03CR) 10Marostegui: [C:03+1] site.pp: Set role for es2053 [puppet] - 10https://gerrit.wikimedia.org/r/1196021 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:51:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host seaborgium.wikimedia.org [11:51:57] (03CR) 10Brouberol: [C:03+2] growthbook: remove sextant dependency files from the root chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195979 (https://phabricator.wikimedia.org/T406786) (owner: 10Brouberol) [11:52:29] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195236 (owner: 10PipelineBot) [11:52:52] (03Merged) 10jenkins-bot: filebackend: Remove consistency check for multi-backend [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196018 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [11:54:40] (03CR) 10Btullis: airflow-main: increase max_map_length from 1024 to 1200 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195985 (https://phabricator.wikimedia.org/T406371) (owner: 10Brouberol) [11:54:46] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1196018|filebackend: Remove consistency check for multi-backend (T328872)]] [11:54:50] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [11:54:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [11:55:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11271561 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye [11:55:35] (03CR) 10Btullis: [C:03+1] airflow: set the default DAG parsing interval to 300s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196026 (https://phabricator.wikimedia.org/T406371) (owner: 10Brouberol) [11:55:40] (03PS2) 10Federico Ceratto: site.pp: Set role for es2053 [puppet] - 10https://gerrit.wikimedia.org/r/1196021 (https://phabricator.wikimedia.org/T402859) [11:55:40] (03PS1) 10Federico Ceratto: preseed.yaml: Remove es2054 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1196030 (https://phabricator.wikimedia.org/T402859) [11:55:42] (03PS1) 10Federico Ceratto: site.pp: Remove es2054 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1196031 (https://phabricator.wikimedia.org/T402859) [11:55:44] (03PS1) 10Federico Ceratto: site.pp: Set role for es2054 [puppet] - 10https://gerrit.wikimedia.org/r/1196032 (https://phabricator.wikimedia.org/T402859) [11:55:46] (03PS1) 10Federico Ceratto: es2054.yaml: Prepare es2054 for es2 [puppet] - 10https://gerrit.wikimedia.org/r/1196033 (https://phabricator.wikimedia.org/T402859) [11:55:51] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::k8s::haproxy: Include host header in access log [puppet] - 10https://gerrit.wikimedia.org/r/1195657 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [11:56:16] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Include host header in access log [puppet] - 10https://gerrit.wikimedia.org/r/1195657 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [11:58:46] (03CR) 10Marostegui: [C:03+1] "It is easier to merge everything in the same patch eg: all es2054 together" [puppet] - 10https://gerrit.wikimedia.org/r/1196033 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:59:00] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1196018|filebackend: Remove consistency check for multi-backend (T328872)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:59:30] (03CR) 10Marostegui: "I assume it is no longer in insetup mode?" [puppet] - 10https://gerrit.wikimedia.org/r/1196032 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:59:31] (03CR) 10Federico Ceratto: [C:03+2] es2054.yaml: Prepare es2054 for es2 [puppet] - 10https://gerrit.wikimedia.org/r/1196033 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:59:42] (03CR) 10Marostegui: [C:03+1] preseed.yaml: Remove es2054 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1196030 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1200) [12:00:07] (03CR) 10Marostegui: [C:03+1] site.pp: Remove es2054 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1196031 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:00:49] (03CR) 10Marostegui: [C:03+1] "Next time send everything together, it is easier that way to review the whole change and it is not that big anyway" [puppet] - 10https://gerrit.wikimedia.org/r/1196032 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:02:13] (03CR) 10Federico Ceratto: [C:03+2] site.pp: Remove es2054 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1196031 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:02:18] (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml: Remove es2054 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1196030 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:02:31] (03CR) 10Federico Ceratto: [C:03+2] site.pp: Set role for es2054 [puppet] - 10https://gerrit.wikimedia.org/r/1196032 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:02:52] (03CR) 10Federico Ceratto: [C:03+2] site.pp: Set role for es2053 [puppet] - 10https://gerrit.wikimedia.org/r/1196021 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:03:08] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:03:22] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:03:36] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:04:37] (03CR) 10Btullis: airflow: allow the deployment of the triggerer component (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196028 (https://phabricator.wikimedia.org/T406958) (owner: 10Brouberol) [12:04:56] (03CR) 10Btullis: [C:03+1] airflow-ml: enable the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196029 (https://phabricator.wikimedia.org/T406958) (owner: 10Brouberol) [12:05:08] ph [12:05:16] sorry, wrong window [12:06:43] PROBLEM - Host ms-be1094 is DOWN: PING CRITICAL - Packet loss = 100% [12:07:02] !log mvernon@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be1094.eqiad.wmnet [12:07:14] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:07:32] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196018|filebackend: Remove consistency check for multi-backend (T328872)]] (duration: 12m 46s) [12:07:36] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [12:08:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2083.codfw.wmnet with reason: host reimage [12:08:39] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye [12:09:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for JavierMonton - https://phabricator.wikimedia.org/T407187#11271651 (10Ladsgroup) [12:09:43] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11271655 (10thcipriani) Sorry for the delay, use-case makes sense to me. Approved! --- @Volker_E note for [[https://wikitech.wikimedia.org/wiki/Scap/SpiderPig|SpiderPig access]] you'll also n... [12:09:59] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11271657 (10thcipriani) a:05thcipriani→03None [12:12:10] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:12:34] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:12:42] (03PS1) 10Hnowlan: rest-gateway: add support for action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196044 (https://phabricator.wikimedia.org/T406324) [12:13:20] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:13:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2083.codfw.wmnet with reason: host reimage [12:15:15] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:15:39] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:16:38] (03PS2) 10Neslihan Turan: Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: 10Seanleong-wmde) [12:16:47] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:16:56] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11271694 (10TheDJ) One thing I have noticed, is that generating new static maps with shape data (from OSM) seems to take like 8'ish seconds. That feels so long, I'm pretty sure that t... [12:17:11] (03CR) 10Dr0ptp4kt: "Looks like these rules are recording, as evidenced on thanos.wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/1193437 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [12:17:22] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:17:50] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:18:20] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:18:26] (03PS1) 10Clément Goubert: trafficserver: Default rest.php routing to rest-gw [puppet] - 10https://gerrit.wikimedia.org/r/1196045 (https://phabricator.wikimedia.org/T406318) [12:18:28] (03PS1) 10Clément Goubert: trafficserver: test2wiki action api to rest-gw [puppet] - 10https://gerrit.wikimedia.org/r/1196046 (https://phabricator.wikimedia.org/T406324) [12:18:32] (03CR) 10Neslihan Turan: [C:03+1] Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: 10Seanleong-wmde) [12:21:23] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11271717 (10elukey) >>! In T381565#11271694, @TheDJ wrote: > One thing I have noticed (after playing after the switch), is that generating new static maps with shape data (from OSM) s... [12:21:56] (03PS1) 10Clément Goubert: rest-gateway: Add routes for action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196047 (https://phabricator.wikimedia.org/T406324) [12:26:50] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11271734 (10TheDJ) For this page it was: ` {"properties":{"title":"Wetter... [12:29:21] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11271751 (10elukey) All right perfect, this is great. The mapframe is surely translating the config to an HTTP call to kartotherian, if we figure out which one I'll be able to compare... [12:30:15] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage [12:30:25] (03PS1) 10Hashar: (DO NOT SUBMIT) Demo to disable Antoine account [puppet] - 10https://gerrit.wikimedia.org/r/1196050 [12:30:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1247 gradually with 4 steps - Pool db1247.eqiad.wmnet in after cloning [12:30:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2083.codfw.wmnet with OS bullseye [12:30:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1247.eqiad.wmnet onto db1260.eqiad.wmnet [12:30:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11271757 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye complete... [12:31:11] (03CR) 10CI reject: [V:04-1] (DO NOT SUBMIT) Demo to disable Antoine account [puppet] - 10https://gerrit.wikimedia.org/r/1196050 (owner: 10Hashar) [12:32:18] (03PS1) 10Clément Goubert: admin: Add cgoubert SSH-FIDO key [puppet] - 10https://gerrit.wikimedia.org/r/1196052 [12:32:43] (03CR) 10Clément Goubert: "Adding @Ladsgroup@gmail.com as clinic duty SRE" [puppet] - 10https://gerrit.wikimedia.org/r/1196052 (owner: 10Clément Goubert) [12:33:13] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage [12:33:29] (03PS2) 10Hashar: (DO NOT SUBMIT) Demo to disable Antoine account [puppet] - 10https://gerrit.wikimedia.org/r/1196050 [12:34:14] (03CR) 10CI reject: [V:04-1] (DO NOT SUBMIT) Demo to disable Antoine account [puppet] - 10https://gerrit.wikimedia.org/r/1196050 (owner: 10Hashar) [12:34:33] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2084.codfw.wmnet with OS bullseye [12:34:40] (03Abandoned) 10Hashar: (DO NOT SUBMIT) Demo to disable Antoine account [puppet] - 10https://gerrit.wikimedia.org/r/1196050 (owner: 10Hashar) [12:34:42] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11271791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2084.codfw.wmnet with OS bullseye [12:36:42] (03PS2) 10Clément Goubert: trafficserver: test2wiki action api to rest-gw [puppet] - 10https://gerrit.wikimedia.org/r/1196046 (https://phabricator.wikimedia.org/T406324) [12:37:37] (03Abandoned) 10Clément Goubert: trafficserver: Default rest.php routing to rest-gw [puppet] - 10https://gerrit.wikimedia.org/r/1196045 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [12:38:20] (03CR) 10Clément Goubert: [C:03+1] trafficserver: remove gateway-check group-specific routes for rest.php [puppet] - 10https://gerrit.wikimedia.org/r/1195679 (https://phabricator.wikimedia.org/T406318) (owner: 10Hnowlan) [12:38:58] (03CR) 10Brouberol: airflow: allow the deployment of the triggerer component (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196028 (https://phabricator.wikimedia.org/T406958) (owner: 10Brouberol) [12:39:41] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone_es of es2032.codfw.wmnet onto es2053.codfw.wmnet [12:39:45] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2032 - Depool es2032.codfw.wmnet to then clone it to es2053.codfw.wmnet - fceratto@cumin1002 [12:39:46] (03CR) 10Brouberol: airflow-main: increase max_map_length from 1024 to 1200 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195985 (https://phabricator.wikimedia.org/T406371) (owner: 10Brouberol) [12:39:53] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2032 - Depool es2032.codfw.wmnet to then clone it to es2053.codfw.wmnet - fceratto@cumin1002 [12:42:11] (03PS1) 10Arnaudb: gerrit: typo fix in post_sync_validation [cookbooks] - 10https://gerrit.wikimedia.org/r/1196051 (https://phabricator.wikimedia.org/T387833) [12:42:11] (03CR) 10Arnaudb: "a small issue that was no blocker for the switchover, still needs to be fixed" [cookbooks] - 10https://gerrit.wikimedia.org/r/1196051 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [12:42:54] fceratto@cumin1002 clone_es (PID 3887543) is awaiting input [12:43:04] (03PS2) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [12:43:04] (03PS2) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [12:43:04] (03PS1) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [12:43:59] (03PS3) 10Seanleong-wmde: Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) [12:44:03] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11271851 (10elukey) Wow really great! I added this [[ https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid?orgId=1&from=now-3h&to=now&timezone=utc&... [12:45:30] (03PS2) 10Brouberol: airflow: set the default DAG parsing interval to 300s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196026 (https://phabricator.wikimedia.org/T407191) [12:45:59] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:46:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2084.codfw.wmnet with reason: host reimage [12:46:42] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:47:01] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:47:12] (03PS3) 10Clément Goubert: trafficserver: test2wiki action api to rest-gw [puppet] - 10https://gerrit.wikimedia.org/r/1196046 (https://phabricator.wikimedia.org/T406324) [12:47:38] (03PS2) 10Brouberol: airflow-main: increase max_map_length from 1024 to 1200 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195985 (https://phabricator.wikimedia.org/T406371) [12:47:39] (03CR) 10Brouberol: airflow-main: increase max_map_length from 1024 to 1200 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195985 (https://phabricator.wikimedia.org/T406371) (owner: 10Brouberol) [12:48:04] (03PS3) 10Brouberol: airflow: increase max_map_length from 1024 to 1200 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195985 (https://phabricator.wikimedia.org/T406371) [12:48:14] (03CR) 10Hnowlan: [C:04-1] api-gateway: Add rate limiting for REST gateway (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [12:49:04] (03CR) 10CI reject: [V:04-1] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [12:50:13] (03CR) 10Ladsgroup: "I need to confirm the identity out of band. Give me a bit." [puppet] - 10https://gerrit.wikimedia.org/r/1196052 (owner: 10Clément Goubert) [12:50:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2084.codfw.wmnet with reason: host reimage [12:51:39] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye [12:51:45] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [12:51:45] (03CR) 10Clément Goubert: "https://www.montanabrandtools.com/cdn/shop/products/MB_65020-2in_P2_Driver_Bit-WEB_1000x.jpg" [puppet] - 10https://gerrit.wikimedia.org/r/1196052 (owner: 10Clément Goubert) [12:51:45] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [12:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:53:13] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1095.eqiad.wmnet [12:55:34] (03PS1) 10Vgutierrez: admin: Add FIDO key for vgutierrez [puppet] - 10https://gerrit.wikimedia.org/r/1196060 [12:56:04] (03PS2) 10Brouberol: airflow: allow the deployment of the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196028 (https://phabricator.wikimedia.org/T406958) [12:56:05] (03PS2) 10Brouberol: airflow-ml: enable the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196029 (https://phabricator.wikimedia.org/T406958) [12:57:22] jouncebot: nowandnext [12:57:22] For the next 0 hour(s) and 2 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1200) [12:57:22] In 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1300) [12:57:43] (03CR) 10Brouberol: airflow: allow the deployment of the triggerer component (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196028 (https://phabricator.wikimedia.org/T406958) (owner: 10Brouberol) [12:58:27] (03PS1) 10Samtar: ext.wikimediaEvents.WatchlistBaseline: Send source/instrument [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196061 (https://phabricator.wikimedia.org/T401575) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1300). [13:00:05] duesen and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:32] o/ [13:00:50] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1095.eqiad.wmnet [13:01:02] hashar: for the security patch, do I have to deploy for the two branches separately? or can they be scapped together? [13:01:19] I guess the script only does one at a time... [13:01:51] o/ [13:02:13] duesen: I thought the script deployed to all branches? but it’s been a few months since I used it [13:02:37] PROBLEM - gerrit process on gerrit2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit/review_site/bin/gerrit.war daemon -d /var/lib/gerrit/review_site https://wikitech.wikimedia.org/wiki/Gerrit [13:02:46] Lucas_WMDE: deploy_security.py? [13:02:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196061 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [13:02:58] I thought so, yes [13:03:05] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Deployment:_via_script also sounds like it [13:03:11] are right, the docs say "You can run it on one branch only if you want." [13:03:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-wikidata: apply [13:03:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-wikidata: apply [13:03:51] ok... I haven't used the script before... I'll just follow the instructions and hope for the best :) [13:04:11] do the dry run first anyway :) [13:04:33] yea... can I just go ahead? [13:04:49] "just follow the instructions and hope for the best" sounds a little quip-worthy /j [13:04:54] go ahead :) [13:05:15] TheresNoTime: *you* sound a little quip-worthy /j [13:05:23] :P [13:05:28] it has been known to happen :D [13:05:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2084.codfw.wmnet with OS bullseye [13:05:35] whomst would ever do such a thing [13:05:36] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11271985 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2084.codfw.wmnet with OS bullseye complete... [13:05:52] (when all the deployments are over, could someone ping me - i have one to do) [13:06:47] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:07:10] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2056.codfw.wmnet'] [13:07:11] hm... the dry run splits out a bunch of lines and exists very fast [13:07:18] not sure whether that's good or bad [13:07:23] yeah, it’s just printing the commands it would run [13:07:34] exit code is 0 [13:07:41] i guess i'll -run it then [13:08:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [13:08:19] (03PS4) 10Clément Goubert: trafficserver: test2wiki action api to rest-gw [puppet] - 10https://gerrit.wikimedia.org/r/1196046 (https://phabricator.wikimedia.org/T406324) [13:08:54] o/ I'm here. Sorry I'm late [13:09:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [13:09:15] running /srv/mediawiki-staging --- scap sync-file [13:09:34] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Add routes for action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196047 (https://phabricator.wikimedia.org/T406324) (owner: 10Clément Goubert) [13:09:36] phuedx: hi! duesen is deploying atm [13:09:39] (03Abandoned) 10Hnowlan: rest-gateway: add support for action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196044 (https://phabricator.wikimedia.org/T406324) (owner: 10Hnowlan) [13:09:46] do you want to self-service afterwards or do you want a deployer? :) [13:09:51] (03PS5) 10Clément Goubert: trafficserver: test2wiki action api to rest-gw [puppet] - 10https://gerrit.wikimedia.org/r/1196046 (https://phabricator.wikimedia.org/T406599) [13:10:15] Lucas_WMDE: I should be able to self-service. Thanks for offering [13:10:19] ok! [13:10:34] Lucas_WMDE: Though I do want to ask: I should be able to do both at the same time, right? [13:10:35] by the way, is there a magic remedy for idle ssh connections freezing? I keep running into this issue. [13:10:43] (03CR) 10Hnowlan: [C:03+1] trafficserver: test2wiki action api to rest-gw [puppet] - 10https://gerrit.wikimedia.org/r/1196046 (https://phabricator.wikimedia.org/T406599) (owner: 10Clément Goubert) [13:10:49] phuedx: yeah that should be fine [13:10:53] (03PS2) 10Stevemunene: airflow-wikidata: define ATS mapping rules and cache settings [puppet] - 10https://gerrit.wikimedia.org/r/1191578 (https://phabricator.wikimedia.org/T404073) [13:11:17] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11272023 (10Ladsgroup) L3 hasn't been signed yet. [13:11:31] (03CR) 10Brouberol: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1191578 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [13:12:23] duesen: Do you have ServerAliveInterval and ServerAliveCountMax setup in your ssh config? [13:12:32] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228 (10APizzata-WMF) 03NEW [13:12:55] (03PS1) 10DCausse: cirrus: prepare completion search with defaultsort A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196064 (https://phabricator.wikimedia.org/T404858) [13:13:55] claime no. should I? [13:14:14] (03PS2) 10Andrew Bogott: prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 [13:14:15] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195769 (owner: 10Andrew Bogott) [13:14:23] RECOVERY - Host ms-be1081 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [13:14:24] scap sync-file has been sitting with no output for 6 minutes now... [13:14:25] duesen: I have them 120 for interval and 30 for count, that makes the client send server alive messages (disabled by default in debian), so the client disconnect after an hour with the server not responding. That usually helps if you have an unstable-ish connection that just hangs, you can set that much shorter so it will disconnect instead of staying hanged [13:14:31] (03CR) 10CI reject: [V:04-1] cirrus: prepare completion search with defaultsort A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196064 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [13:14:46] TheresNoTime, duesen: https://bash.toolforge.org/quip/x9Pb4pkB8tZ8Ohr0Z6I4 [13:15:03] (03CR) 10CI reject: [V:04-1] prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 (owner: 10Andrew Bogott) [13:15:06] * Lucas_WMDE gets empty output from ssh -G deployment.eqiad.wmnet | grep ServerAlive o_O [13:15:25] (03CR) 10Xcollazo: airflow: increase max_map_length from 1024 to 1200 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195985 (https://phabricator.wikimedia.org/T406371) (owner: 10Brouberol) [13:15:26] xSavitar: heh [13:15:33] xSavitar: :P [13:16:07] Lucas_WMDE: grep -i [13:16:11] it prints all downcased [13:16:27] !log daniel Deployed security patch for T405859 [13:16:31] bah [13:16:34] (03CR) 10Vgutierrez: [C:03+1] airflow-wikidata: define ATS mapping rules and cache settings [puppet] - 10https://gerrit.wikimedia.org/r/1191578 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [13:16:42] thanks, you’re right. I have conutmax 3, interval 0 [13:16:45] RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [13:16:45] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [13:16:51] (03CR) 10Brouberol: [C:03+2] airflow-wikidata: define ATS mapping rules and cache settings [puppet] - 10https://gerrit.wikimedia.org/r/1191578 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [13:16:55] guess that’s the arch (bsd?) defaults instead of the debian defaults and that might be why I haven’t noticed the problem [13:16:58] duesen: also if you didn't know, to kill a hanged ssh session, enter then ~. [13:17:13] grr... i got '1.45.0-wmf.23/core/16-T405859.patch' has mode 644 but it should be 664. Please chmod 664 1.45.0-wmf.23/core/16-T405859.patch [13:17:21] ok, i'll chmod and try again [13:17:26] o_O [13:17:26] Lucas_WMDE: interval 0 means it's disabled and only sending tcpkeepalive [13:17:38] ok [13:18:28] I have a Host * directive at the top of my ssh_config with ServerAliveInterval 120 ServerAliveCountMax 30 [13:19:10] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:restbase-codfw [13:20:02] (03PS3) 10Andrew Bogott: prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 [13:20:04] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195769 (owner: 10Andrew Bogott) [13:21:18] (03CR) 10Brouberol: [C:03+2] airflow: set the default DAG parsing interval to 300s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196026 (https://phabricator.wikimedia.org/T407191) (owner: 10Brouberol) [13:22:14] (03CR) 10Brouberol: airflow: increase max_map_length from 1024 to 1200 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195985 (https://phabricator.wikimedia.org/T406371) (owner: 10Brouberol) [13:23:57] (03CR) 10Xcollazo: [C:03+1] airflow: increase max_map_length from 1024 to 1200 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195985 (https://phabricator.wikimedia.org/T406371) (owner: 10Brouberol) [13:24:22] FIRING: [6x] ProbeDown: Service restbase2024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:24:49] (03CR) 10Brouberol: [C:03+2] airflow: increase max_map_length from 1024 to 1200 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195985 (https://phabricator.wikimedia.org/T406371) (owner: 10Brouberol) [13:25:01] (03CR) 10Majavah: [C:03+1] "Looking at the manpage, ifquery will exit 0 if the interface config exists, which seems to match what we want to do here." [puppet] - 10https://gerrit.wikimedia.org/r/1195192 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [13:25:35] duesen: How's it going? [13:25:53] phuedx: slowly... [13:26:01] !log daniel Deployed security patch for T405859 [13:26:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:26:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and key also validated out-of-band" [puppet] - 10https://gerrit.wikimedia.org/r/1196060 (owner: 10Vgutierrez) [13:26:26] the deployment script for security patches takes forever. [13:26:31] (03CR) 10Majavah: "can you do a PCC for this?" [puppet] - 10https://gerrit.wikimedia.org/r/1195194 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [13:26:32] (03CR) 10Vgutierrez: [C:03+2] admin: Add FIDO key for vgutierrez [puppet] - 10https://gerrit.wikimedia.org/r/1196060 (owner: 10Vgutierrez) [13:26:33] ...and now it failed. in a dirty state. gah. [13:26:45] D: [13:26:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:26:53] FIRING: [6x] ProbeDown: Service restbase2024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:56] error: patch failed: includes/import/ImportableOldRevisionImporter.php:66 [13:26:56] error: includes/import/ImportableOldRevisionImporter.php: patch does not apply [13:27:05] Lucas_WMDE: help :D [13:27:47] *looks* [13:27:48] oh no [13:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:10] hm, it looks like it went in for .22 but not for .23 [13:28:10] was this in the wmf.23 or wmf.22 part? [13:28:42] for wmf.23 [13:29:22] RESOLVED: [6x] ProbeDown: Service restbase2024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:34] odd, i did check locally... [13:29:43] weird, I don’t see the file being touched lately… [13:29:44] but maybe a backport changed the branch? [13:30:00] maybe it's because of the failed patch earlier? can it conflict with itself? [13:30:03] not AFAICT [13:30:08] (03CR) 10Andrew Bogott: [C:03+2] reprepro: add trixie component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1195773 (owner: 10Andrew Bogott) [13:30:09] * Lucas_WMDE looks at git on the deployment server [13:30:18] (03PS2) 10Andrew Bogott: reprepro: add trixie component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1195773 [13:30:53] I think that must be it yeah [13:31:00] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-logging-codfw [13:31:00] it’s applied in both wmf.23 and wmf.22 AFAICT [13:31:12] according to git, at least. idk if it was also synced everywhere [13:31:25] so I guess trying to apply it again fails [13:31:42] so… copy it to wmf.23 in /srv/patches and that should be it? [13:31:56] and then the next scap for the backport window (by phuedx) will make sure it’s definitely applied everywhere [13:33:10] (03CR) 10Andrew Bogott: [C:03+2] reprepro: add trixie component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1195773 (owner: 10Andrew Bogott) [13:33:36] Lucas_WMDE: it's already there. [13:33:55] ah, i see [13:33:58] the commit message confused me then [13:34:04] so it's merged into both branches and it's in /srv/patches on the deploy host. Also, scap sync-file ran successfully. [13:34:11] yeah, I think you’re done then [13:34:13] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp7001*} or P{cp4037*} and A:cp [13:34:22] FIRING: [12x] ProbeDown: Service restbase2024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:34:45] Lucas_WMDE: ok nice. Thanks for your help! [13:34:50] np [13:34:53] PROBLEM - Host ms-be1081 is DOWN: PING CRITICAL - Packet loss = 100% [13:34:58] phuedx: over to you! [13:35:01] phuedx: i'm done, go ahead [13:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:35:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:35:17] (03CR) 10Neslihan Turan: [C:03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: 10Seanleong-wmde) [13:36:05] Lucas_WMDE, duesen: I suppose congratulations are in order :D [13:36:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2003.wikimedia.org [13:36:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:36:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196025 (owner: 10Phuedx) [13:36:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196024 (owner: 10Phuedx) [13:37:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:38:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:38:51] RECOVERY - Host ms-be1081 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [13:39:30] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2056.codfw.wmnet'] [13:39:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:40:00] (03PS1) 10Andrew Bogott: distributions-wikimedia trixie: fix sorting [puppet] - 10https://gerrit.wikimedia.org/r/1196070 [13:40:15] (03PS2) 10DCausse: cirrus: prepare completion search with defaultsort A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196064 (https://phabricator.wikimedia.org/T404858) [13:40:20] Lucas_WMDE: I think the the script should check file permissions (or just chmod) before running scap... I'm still a little confused about *what* refuxed to apply the fix, and why the failure happend *after* scap... [13:40:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2003.wikimedia.org [13:41:15] PROBLEM - Host ms-be1081 is DOWN: PING CRITICAL - Packet loss = 100% [13:41:32] ping Amir1 who wrote deploy_security.py ^ [13:42:15] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2056.codfw.wmnet'] [13:44:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1003.wikimedia.org [13:44:40] (03PS1) 10Brouberol: airflow: add the -ops prefix to the Op LDAP group name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196074 (https://phabricator.wikimedia.org/T407238) [13:44:57] (03PS2) 10Elukey: sre.hardware.upgrade-firmware: improve matching for SSD checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) [13:45:57] I wrote it but I don't own it [13:46:10] if you're not comfortable, use the manual process [13:46:37] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7001.magru.wmnet [13:46:39] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2056.codfw.wmnet'] [13:47:51] RECOVERY - Host ms-be1081 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [13:48:31] (03CR) 10Majavah: [C:03+1] distributions-wikimedia trixie: fix sorting [puppet] - 10https://gerrit.wikimedia.org/r/1196070 (owner: 10Andrew Bogott) [13:48:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1003.wikimedia.org [13:49:14] (03Merged) 10jenkins-bot: ext.wikimediaEvents: simple-bot-detection: Use correct schema [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196025 (owner: 10Phuedx) [13:49:14] (03Merged) 10jenkins-bot: ext.wikimediaEvents: simple-bot-detection: Use correct schema [extensions/WikimediaEvents] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196024 (owner: 10Phuedx) [13:49:22] (03CR) 10Andrew Bogott: [C:03+2] distributions-wikimedia trixie: fix sorting [puppet] - 10https://gerrit.wikimedia.org/r/1196070 (owner: 10Andrew Bogott) [13:49:32] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11272354 (10MatthewVernon) [13:49:33] oh wow, that took a while to merge [13:49:48] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1196025|ext.wikimediaEvents: simple-bot-detection: Use correct schema]], [[gerrit:1196024|ext.wikimediaEvents: simple-bot-detection: Use correct schema]] [13:49:55] TheresNoTime: you should probably rebase and then +2 your backport right away, so it’s ready to deploy once phuedx is done [13:50:14] Lucas_WMDE: ack [13:50:15] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11272356 (10MatthewVernon) Hi @Jhancock.wm ms-be2083 looks great, thank you. I'm afraid ms-be2084 isn't happy - it thinks it has 0 physical disks attache... [13:51:14] (03CR) 10Samtar: [C:03+2] "deploy" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196061 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [13:51:15] (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1196078 [13:52:54] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2056.codfw.wmnet'] [13:53:20] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2056.codfw.wmnet'] [13:53:37] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2056.codfw.wmnet'] [13:53:37] …or +2 without rebase and let Gerrit do it, I guess ^^ [13:53:47] (03PS1) 10Eevans: admin: add FIDO key for eevans [puppet] - 10https://gerrit.wikimedia.org/r/1196079 [13:54:12] !log phuedx@deploy2002 phuedx: Backport for [[gerrit:1196025|ext.wikimediaEvents: simple-bot-detection: Use correct schema]], [[gerrit:1196024|ext.wikimediaEvents: simple-bot-detection: Use correct schema]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:54:20] Looking [13:54:50] (03CR) 10Btullis: [C:03+1] airflow: add the -ops prefix to the Op LDAP group name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196074 (https://phabricator.wikimedia.org/T407238) (owner: 10Brouberol) [13:55:12] (03CR) 10Brouberol: [C:03+2] airflow: add the -ops prefix to the Op LDAP group name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196074 (https://phabricator.wikimedia.org/T407238) (owner: 10Brouberol) [13:55:31] (03CR) 10Bking: [C:03+2] opensearch on k8s: add service definitions [puppet] - 10https://gerrit.wikimedia.org/r/1195342 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [13:55:38] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:56:00] LGTM. I got the instrument to submit events to the correct stream with the correct schema [13:56:04] !log phuedx@deploy2002 phuedx: Continuing with sync [13:56:10] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:56:35] !log fceratto@cumin1002 START - Cookbook sre.ganeti.makevm for new host db-test1001.eqiad.wmnet [13:56:37] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [13:57:24] (03PS3) 10Brouberol: airflow: allow the deployment of the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196028 (https://phabricator.wikimedia.org/T406958) [13:57:24] (03PS3) 10Brouberol: airflow-ml: enable the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196029 (https://phabricator.wikimedia.org/T406958) [13:58:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:58:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:58:47] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1081 reports no disks - controller failure? - https://phabricator.wikimedia.org/T407198#11272368 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Confirmed the issue, which was noticed in the BIOS. Verified internal connections and performed... [13:59:08] (03CR) 10Btullis: [C:03+1] airflow: allow the deployment of the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196028 (https://phabricator.wikimedia.org/T406958) (owner: 10Brouberol) [13:59:16] (03CR) 10Elukey: [C:04-1] "still not ready :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [14:00:05] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1400) [14:00:05] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196025|ext.wikimediaEvents: simple-bot-detection: Use correct schema]], [[gerrit:1196024|ext.wikimediaEvents: simple-bot-detection: Use correct schema]] (duration: 10m 17s) [14:00:09] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1001.eqiad.wmnet - fceratto@cumin1002" [14:00:23] TheresNoTime: Over to you [14:00:29] phuedx: many thanks [14:00:30] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1001.eqiad.wmnet - fceratto@cumin1002" [14:00:30] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:00:30] !log fceratto@cumin1002 START - Cookbook sre.dns.wipe-cache db-test1001.eqiad.wmnet on all recursors [14:00:33] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db-test1001.eqiad.wmnet on all recursors [14:01:00] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1001.eqiad.wmnet - fceratto@cumin1002" [14:01:06] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1001.eqiad.wmnet - fceratto@cumin1002" [14:01:23] (03Merged) 10jenkins-bot: ext.wikimediaEvents.WatchlistBaseline: Send source/instrument [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196061 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [14:02:08] !log fceratto@cumin1002 START - Cookbook sre.hosts.reimage for host db-test1001.eqiad.wmnet with OS trixie [14:02:21] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1196061|ext.wikimediaEvents.WatchlistBaseline: Send source/instrument (T401575)]] [14:02:24] T401575: WE1.4.3: Instrument watchlist - https://phabricator.wikimedia.org/T401575 [14:03:34] (03CR) 10CDanis: [C:03+2] haproxylua: add core.concat() reimpl [puppet] - 10https://gerrit.wikimedia.org/r/1195228 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [14:03:48] (03PS1) 10Andrew Bogott: designate_sink: use shlex.quote() rather than the now-obsolete pipes.quote() [puppet] - 10https://gerrit.wikimedia.org/r/1196084 (https://phabricator.wikimedia.org/T406516) [14:04:13] (03PS3) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [14:04:14] (03PS3) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [14:04:14] (03PS2) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [14:05:02] (03CR) 10CI reject: [V:04-1] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [14:05:33] (03CR) 10CDanis: [C:03+2] ja3n: use core.concat() [puppet] - 10https://gerrit.wikimedia.org/r/1195231 (owner: 10CDanis) [14:05:39] (03CR) 10Andrew Bogott: [C:03+2] designate_sink: use shlex.quote() rather than the now-obsolete pipes.quote() [puppet] - 10https://gerrit.wikimedia.org/r/1196084 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [14:06:12] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1183101 (https://phabricator.wikimedia.org/T402406) (owner: 10Muehlenhoff) [14:06:21] (03PS1) 10Jcrespo: admin: Replaced yubico key with one with a key handle stored on disk [puppet] - 10https://gerrit.wikimedia.org/r/1196085 [14:06:34] !log samtar@deploy2002 samtar: Backport for [[gerrit:1196061|ext.wikimediaEvents.WatchlistBaseline: Send source/instrument (T401575)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:06:38] * TheresNoTime looking ^ [14:06:39] (03PS1) 10Brouberol: airflow: avoid repeating analytics twice in the principal name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196086 (https://phabricator.wikimedia.org/T407238) [14:06:43] (03PS2) 10Jcrespo: admin: Replace yubikey with one with a key handle stored on disk [puppet] - 10https://gerrit.wikimedia.org/r/1196085 [14:07:19] (03CR) 10CDanis: [C:03+1] Failover failoid in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1183101 (https://phabricator.wikimedia.org/T402406) (owner: 10Muehlenhoff) [14:07:46] (03PS4) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [14:07:46] (03PS4) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [14:07:46] (03PS3) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [14:07:47] !log samtar@deploy2002 samtar: Continuing with sync [14:08:47] (03PS3) 10Elukey: sre.hardware.upgrade-firmware: improve matching for SSD checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) [14:09:06] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2056.codfw.wmnet'] [14:09:30] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2056.codfw.wmnet'] [14:09:42] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2055.codfw.wmnet'] [14:09:59] (03CR) 10Majavah: Add missing Cumin alias for cloudrabbit/codfw1dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193836 (owner: 10Muehlenhoff) [14:11:03] (03CR) 10Brouberol: [C:03+2] airflow: avoid repeating analytics twice in the principal name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196086 (https://phabricator.wikimedia.org/T407238) (owner: 10Brouberol) [14:11:38] (03CR) 10CI reject: [V:04-1] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [14:11:46] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196061|ext.wikimediaEvents.WatchlistBaseline: Send source/instrument (T401575)]] (duration: 09m 25s) [14:11:50] T401575: WE1.4.3: Instrument watchlist - https://phabricator.wikimedia.org/T401575 [14:12:27] * TheresNoTime is done [14:12:31] !log fceratto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test1001.eqiad.wmnet with reason: host reimage [14:12:44] here's hoping that's the instrumentation *done* now! (☞゚ヮ゚)☞ [14:13:32] (03PS1) 10Brouberol: airflow-analytics-test: fix role mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196088 (https://phabricator.wikimedia.org/T407238) [14:14:47] !log UTC afternoon backport+config window done [14:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:01] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: fix role mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196088 (https://phabricator.wikimedia.org/T407238) (owner: 10Brouberol) [14:16:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, key also validated out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1196079 (owner: 10Eevans) [14:16:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:17:14] (03CR) 10Eevans: [C:03+2] admin: add FIDO key for eevans [puppet] - 10https://gerrit.wikimedia.org/r/1196079 (owner: 10Eevans) [14:17:15] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2055.codfw.wmnet'] [14:17:35] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11272464 (10Lars) [14:17:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:17:56] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11272467 (10Lars) @Ladsgroup Have now signed. [14:18:25] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-logging-codfw [14:18:56] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test1001.eqiad.wmnet with reason: host reimage [14:19:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [14:19:15] nb. seeing a lot (582+) of `PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::getImages was deprecated in MediaWiki 1.43` since 14:10 UTC - see T407240 [14:19:16] T407240: PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::getImages was deprecated in MediaWiki 1.43. [Called from MediaWiki\Extension\GlobalUsage\Hooks::onLinksUpdateComplete] - https://phabricator.wikimedia.org/T407240 [14:20:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [14:20:12] which coincides with my backport for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1196061 finishing, but that change is unrelated.. [14:21:06] (03CR) 10Hnowlan: [C:03+1] api-gateway: Add support for PHP_ENGINE cookie routing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [14:21:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [14:21:32] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:34] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Divert PHP_ENGINE=8.3 requests to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194791 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [14:21:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [14:22:41] (03PS1) 10Cathal Mooney: sudoers: allow members of datacenter-ops group run homer [puppet] - 10https://gerrit.wikimedia.org/r/1196090 (https://phabricator.wikimedia.org/T402511) [14:22:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [14:23:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [14:24:00] jouncebot: nowandnext [14:24:00] For the next 0 hour(s) and 5 minute(s): Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1400) [14:24:00] In 0 hour(s) and 5 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1430) [14:25:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:25:38] (03PS4) 10Seanleong-wmde: Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) [14:25:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:25:51] (03CR) 10Elukey: sre.hardware.upgrade-firmware: improve matching for SSD checks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [14:26:05] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-logging-eqiad [14:26:21] (03PS5) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [14:26:21] (03PS5) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [14:26:21] (03PS4) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [14:26:25] (03CR) 10Muehlenhoff: [C:03+1] "The patch looks good syntax and this also has my +1, but this will need meeting approval in the next IF meeting as it changes a sudo role." [puppet] - 10https://gerrit.wikimedia.org/r/1196090 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [14:26:53] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:26:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [14:27:21] (fwiw, T407240 was a dupe and looks to have stopped now anyway!) [14:27:21] T407240: PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::getImages was deprecated in MediaWiki 1.43. [Called from MediaWiki\Extension\GlobalUsage\Hooks::onLinksUpdateComplete] - https://phabricator.wikimedia.org/T407240 [14:27:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [14:28:19] (03CR) 10CI reject: [V:04-1] Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [14:28:28] (03CR) 10Clément Goubert: [C:03+1] mw-(api-ext|web): Scale next releases to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194716 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [14:29:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [14:29:34] (03CR) 10CI reject: [V:04-1] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [14:30:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [14:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1430) [14:30:05] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4037.ulsfo.wmnet [14:30:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp7001*} or P{cp4037*} and A:cp [14:30:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and key has also been validated via an out-of-band channel" [puppet] - 10https://gerrit.wikimedia.org/r/1196052 (owner: 10Clément Goubert) [14:30:26] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#11272531 (10elukey) We are still dropping old buckets, it takes a really long time but I have a tmux session on thanos-... [14:30:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:31:09] (03CR) 10Clément Goubert: [C:03+1] Enroll 1% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194718 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [14:31:20] (03PS3) 10Majavah: hieradata: disable agent forwarding in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/875899 (https://phabricator.wikimedia.org/T198138) [14:31:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:32:11] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db-test1001.eqiad.wmnet with OS trixie [14:32:11] !log fceratto@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db-test1001.eqiad.wmnet [14:32:33] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:32:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [14:33:21] (03CR) 10Majavah: [C:03+2] hieradata: disable agent forwarding in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/875899 (https://phabricator.wikimedia.org/T198138) (owner: 10Majavah) [14:33:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [14:34:00] (03CR) 10Cathal Mooney: "LGTM. In theory you can use these sysctl's to make sure the route doesn't get used if the interface is down, but I've seen Linux not do w" [puppet] - 10https://gerrit.wikimedia.org/r/1195194 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [14:34:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [14:34:46] (03CR) 10Cathal Mooney: [C:03+1] "Yeah all good makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/1194933 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [14:35:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [14:35:23] (03CR) 10Cathal Mooney: "LGTM, as taavi said PCC output would be good to be doubly sure, but seems fine." [puppet] - 10https://gerrit.wikimedia.org/r/1195193 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [14:35:50] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [14:35:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:36:44] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:36:47] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11272588 (10Ladsgroup) [14:39:06] (03CR) 10Clément Goubert: [C:03+2] admin: Add cgoubert SSH-FIDO key [puppet] - 10https://gerrit.wikimedia.org/r/1196052 (owner: 10Clément Goubert) [14:41:30] (03PS4) 10Brouberol: airflow: allow the deployment of the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196028 (https://phabricator.wikimedia.org/T406958) [14:41:30] (03PS4) 10Brouberol: airflow-ml: enable the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196029 (https://phabricator.wikimedia.org/T406958) [14:42:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:35] (03CR) 10Cathal Mooney: [C:03+1] "Overall LGTM, let's get taavi's input too as he is quite familiar with the setup." [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [14:43:59] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1195192 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [14:44:47] RESOLVED: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:25] (03CR) 10CDanis: [C:03+2] haproxy tls_terminator template cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1195041 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [14:46:59] (03CR) 10Majavah: cloudceph: handle double / single NIC transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [14:48:13] (03PS6) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [14:48:13] (03PS6) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [14:48:13] (03PS5) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [14:49:04] (03CR) 10Clément Goubert: [C:03+1] trafficserver: enable PHP_ENGINE next routing [puppet] - 10https://gerrit.wikimedia.org/r/1192228 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [14:50:24] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11272641 (10elukey) @Mvolz I think that the extra traffic registered by the mesh is for `/_info`, because the envoy metrics don't contain any way... [14:50:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [14:51:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [14:51:44] (03CR) 10CI reject: [V:04-1] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [14:54:36] (03PS31) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [14:56:45] (03PS1) 10Joely Rooke WMDE: DEMO CHERRY PICK - DO NOT DEPLOY [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196097 (https://phabricator.wikimedia.org/T395674) [14:58:28] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:58:31] (03PS7) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [14:58:31] (03PS7) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [14:58:31] (03PS6) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [14:58:41] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on phab2002.codfw.wmnet,phab[1004-1005].eqiad.wmnet with reason: T407244 [14:58:45] T407244: Deploy Phabricator/Phorge 2025-10-14 - https://phabricator.wikimedia.org/T407244 [15:00:05] jelto, arnoldokoth, and mutante: SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1500). Please do the needful. [15:01:43] (03PS1) 10Ahmon Dancy: buildkitd: Bump buildkit image to wmf-v0.25.1 [puppet] - 10https://gerrit.wikimedia.org/r/1196099 (https://phabricator.wikimedia.org/T406772) [15:02:00] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11272671 (10VRiley-WMF) Hey @MatthewVernon Just wanted to check in and see if the other two maybe be ready? Let us now, thanks! [15:02:02] (03CR) 10CI reject: [V:04-1] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [15:02:32] !log brennen@deploy2002 Started deploy [phabricator/deployment@16c9739]: deploy phab2002 for T407244 [15:03:03] !log brennen@deploy2002 Finished deploy [phabricator/deployment@16c9739]: deploy phab2002 for T407244 (duration: 00m 31s) [15:03:27] !log brennen@deploy2002 Started deploy [phabricator/deployment@16c9739]: deploy phab1004 for T407244 [15:04:26] !log brennen@deploy2002 Finished deploy [phabricator/deployment@16c9739]: deploy phab1004 for T407244 (duration: 00m 58s) [15:04:29] T407244: Deploy Phabricator/Phorge 2025-10-14 - https://phabricator.wikimedia.org/T407244 [15:05:25] (03Abandoned) 10Joely Rooke WMDE: DEMO CHERRY PICK - DO NOT DEPLOY [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196097 (https://phabricator.wikimedia.org/T395674) (owner: 10Joely Rooke WMDE) [15:05:48] (03CR) 10MVernon: [C:03+1] Provision hosts aqs102[3-7] (refresh of aqs101[0-2,4-5]) [puppet] - 10https://gerrit.wikimedia.org/r/1195276 (https://phabricator.wikimedia.org/T407032) (owner: 10Eevans) [15:05:49] !log fceratto@cumin1002 START - Cookbook sre.ganeti.makevm for new host db-test1003.eqiad.wmnet [15:05:50] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [15:08:27] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:09] (03CR) 10Brennen Bearnes: [C:03+1] "Will test in devtools prior to production deploy." [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [15:11:35] fceratto@cumin1002 makevm (PID 4156671) is awaiting input [15:12:45] (03CR) 10Eevans: [C:03+2] Provision hosts aqs102[3-7] (refresh of aqs101[0-2,4-5]) [puppet] - 10https://gerrit.wikimedia.org/r/1195276 (https://phabricator.wikimedia.org/T407032) (owner: 10Eevans) [15:17:01] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11272733 (10MatthewVernon) Hi @VRiley-WMF I'm afraid not (filesystems still about 25% full, so a little way to go yet). [15:17:07] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-logging-eqiad [15:18:37] (03PS1) 10Neslihan Turan: Revert "Add icons for wikibase changes. WIP" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196101 [15:20:10] !log installing jq security updates [15:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:57] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 25): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7272/c" [puppet] - 10https://gerrit.wikimedia.org/r/1195194 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [15:21:44] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1003.eqiad.wmnet - fceratto@cumin1002" [15:24:49] fceratto@cumin1002 makevm (PID 4156671) is awaiting input [15:27:42] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 27): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7274/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1195192 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [15:27:47] (03PS1) 10Muehlenhoff: Add library hint for jq [puppet] - 10https://gerrit.wikimedia.org/r/1196107 [15:30:04] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1003.eqiad.wmnet - fceratto@cumin1002" [15:30:04] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:30:04] !log fceratto@cumin1002 START - Cookbook sre.dns.wipe-cache db-test1003.eqiad.wmnet on all recursors [15:30:07] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db-test1003.eqiad.wmnet on all recursors [15:30:15] (03PS1) 10Zabe: BETA: Try using Hadoop QueryPage computations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196109 (https://phabricator.wikimedia.org/T309738) [15:30:35] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1003.eqiad.wmnet - fceratto@cumin1002" [15:30:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, key also validated out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1196085 (owner: 10Jcrespo) [15:30:40] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1003.eqiad.wmnet - fceratto@cumin1002" [15:30:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11272796 (10elukey) >>! In T392851#11261019, @Jhancock.wm wrote: > @elukey license uploaded for cp2056. should be good to try that one again. Host is good now!... [15:31:10] (03PS1) 10Brouberol: airflow-main: reduce alotted CPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196111 (https://phabricator.wikimedia.org/T407191) [15:31:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196101 (owner: 10Neslihan Turan) [15:33:46] fceratto@cumin1002 makevm (PID 4156671) is awaiting input [15:33:51] !log disable-puppet on A:cp hosts - T405955 [15:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:54] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [15:34:09] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1192228 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [15:34:14] (03CR) 10Scott French: [C:03+2] trafficserver: enable PHP_ENGINE next routing [puppet] - 10https://gerrit.wikimedia.org/r/1192228 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [15:38:27] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:40:24] (03CR) 10Dzahn: [C:03+2] buildkitd: Bump buildkit image to wmf-v0.25.1 [puppet] - 10https://gerrit.wikimedia.org/r/1196099 (https://phabricator.wikimedia.org/T406772) (owner: 10Ahmon Dancy) [15:43:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11272912 (10SKaram-WMF) [15:44:42] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11272916 (10SKaram-WMF) Hi, thank you for the suggestion. I've replaced it with the ed one. Thank you! [15:44:53] (03CR) 10Majavah: interface: add pre_down_command define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195193 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [15:46:01] (03CR) 10Dzahn: "I would say because knowing that current data is synced between hosts and can be restored to either of the hosts (without copying private " [puppet] - 10https://gerrit.wikimedia.org/r/1195432 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [15:46:25] !log rolling run-puppet-agent on A:cp hosts - T405955 [15:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:29] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [15:47:49] !log fceratto@cumin1002 START - Cookbook sre.hosts.reimage for host db-test1003.eqiad.wmnet with OS trixie [15:48:26] (03PS1) 10Zabe: Using Hadoop for MostTranscludedPages on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196112 (https://phabricator.wikimedia.org/T309738) [15:48:54] (03CR) 10Zabe: [C:04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196112 (https://phabricator.wikimedia.org/T309738) (owner: 10Zabe) [15:50:02] (03CR) 10Majavah: Using Hadoop for MostTranscludedPages on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196112 (https://phabricator.wikimedia.org/T309738) (owner: 10Zabe) [15:50:12] !log contint2002 - rebooting - (not the manager host) [15:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:57] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on contint2002.wikimedia.org with reason: reboot [15:52:10] jouncebot: nowandnext [15:52:10] For the next 0 hour(s) and 7 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1500) [15:52:10] In 0 hour(s) and 7 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1600) [15:53:03] (03PS32) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [15:53:08] swfrench-wmf: both slots are empty ..if you wanted more time or start early [15:53:57] mutante: thanks! I won't be touching mediawiki for a bit, but wanted to see what else might be happening before I start testing some rest-gateway changes [15:54:14] ack. seems all quiet [15:54:36] just doing some reboots [15:54:39] (03CR) 10Zabe: [C:04-2] Using Hadoop for MostTranscludedPages on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196112 (https://phabricator.wikimedia.org/T309738) (owner: 10Zabe) [15:55:23] mutante: sounds good. thanks! [15:56:05] (03CR) 10Scott French: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [15:56:32] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 46 NOOP 43): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-nod" [puppet] - 10https://gerrit.wikimedia.org/r/1195193 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [15:56:40] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on contint1002.wikimedia.org with reason: reboot [15:57:24] !log fceratto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test1003.eqiad.wmnet with reason: host reimage [15:57:28] 10ops-esams, 06SRE, 06DC-Ops: esams: access switches fans blowing the wrong way - https://phabricator.wikimedia.org/T406734#11273028 (10cmooney) [15:57:39] !log rebooting main CI server - integration.wikimedia.org will be down for a minute [15:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:58] 10ops-esams, 06SRE, 06DC-Ops: esams: access switches fans blowing the wrong way - https://phabricator.wikimedia.org/T406734#11273029 (10RobH) Summary of issue: The switches in esams are racked with the ports facing the cold aisle, they should be racked with the ports facing the hot aisle. Solution: Offline... [15:59:18] (03PS1) 10Andrew Bogott: wmcs-annual-purge.py: update to catch up with various openstack changes [puppet] - 10https://gerrit.wikimedia.org/r/1196114 [15:59:42] (03CR) 10Scott French: [C:03+2] api-gateway: Add support for PHP_ENGINE cookie routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [16:00:05] jhathaway and moritzm: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:03:13] !log CI should be back in operation as normal [16:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:47] jouncebot: nowandnext [16:03:47] For the next 0 hour(s) and 56 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1600) [16:03:48] In 0 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1700) [16:04:30] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test1003.eqiad.wmnet with reason: host reimage [16:05:39] (03CR) 10Hashar: [C:03+2] "CR+2 again cause the server hosting Zuul has been restarted." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [16:06:29] (03CR) 10Zabe: [C:04-2] Using Hadoop for MostTranscludedPages on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196112 (https://phabricator.wikimedia.org/T309738) (owner: 10Zabe) [16:07:06] (03CR) 10Kamila Součková: haptcha: add new role for hCaptcha proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [16:07:40] (03Merged) 10jenkins-bot: api-gateway: Add support for PHP_ENGINE cookie routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [16:08:45] 06SRE, 10DNS, 10Domains, 06Traffic: Request to create the 25.wikipedia.org domain + 301 redirect to the org site - https://phabricator.wikimedia.org/T407156#11273103 (10ssingh) a:03BCornwall [16:09:56] (03PS1) 10Andrew Bogott: wmcs-annual-purge: more 2025 updates [puppet] - 10https://gerrit.wikimedia.org/r/1196117 [16:10:14] (03PS2) 10Andrew Bogott: wmcs-annual-purge.py: update to catch up with various openstack changes [puppet] - 10https://gerrit.wikimedia.org/r/1196114 [16:10:17] (03Abandoned) 10Andrew Bogott: wmcs-annual-purge: more 2025 updates [puppet] - 10https://gerrit.wikimedia.org/r/1196117 (owner: 10Andrew Bogott) [16:11:01] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab2002.codfw.wmnet with reason: reboot [16:12:44] !log rebooting phab2002 [16:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:38] (03PS1) 10Cathal Mooney: Add inter.link transit on cr1-drmrs and set up community for anti-ddos [homer/public] - 10https://gerrit.wikimedia.org/r/1196118 (https://phabricator.wikimedia.org/T401104) [16:14:59] (03PS1) 10BCornwall: Add wmf-debci trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1196119 [16:15:14] (03CR) 10Ottomata: [C:03+1] [mw-enrichment] Bump to v1.42.0 and Flink 1.20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195223 (https://phabricator.wikimedia.org/T401725) (owner: 10TChin) [16:15:16] (03PS3) 10Btullis: Enable notifications for an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195775 (https://phabricator.wikimedia.org/T402943) [16:15:16] (03PS3) 10Btullis: Enable canary events on an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195778 (https://phabricator.wikimedia.org/T402943) [16:15:16] (03PS3) 10Btullis: Migrate data_check refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195780 (https://phabricator.wikimedia.org/T402943) [16:15:17] (03PS3) 10Btullis: Migrate the hdfs_cleaner refinery jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195781 (https://phabricator.wikimedia.org/T402943) [16:15:18] (03PS3) 10Btullis: Migrate the import_*_dumps systemd jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195782 (https://phabricator.wikimedia.org/T402943) [16:15:21] (03PS3) 10Btullis: Migrate the project_namespace_map refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195783 (https://phabricator.wikimedia.org/T402943) [16:15:25] (03PS3) 10Btullis: Migrate the refine_netflow job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) [16:15:29] (03PS3) 10Btullis: Migrate sqoop jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195784 (https://phabricator.wikimedia.org/T402943) [16:15:33] (03PS3) 10Btullis: Migrate the data_purge jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195785 (https://phabricator.wikimedia.org/T402943) [16:15:37] (03PS3) 10Btullis: Migrate refine_sanitize jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943) [16:16:06] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:17:01] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:17:49] (03CR) 10Cathal Mooney: [C:03+2] Add inter.link transit on cr1-drmrs and set up community for anti-ddos [homer/public] - 10https://gerrit.wikimedia.org/r/1196118 (https://phabricator.wikimedia.org/T401104) (owner: 10Cathal Mooney) [16:18:04] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db-test1003.eqiad.wmnet with OS trixie [16:18:04] !log fceratto@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db-test1003.eqiad.wmnet [16:18:34] (03CR) 10BCornwall: [C:03+1] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1195828 (https://phabricator.wikimedia.org/T407177) (owner: 10Gerrit maintenance bot) [16:18:50] (03PS4) 10Btullis: Migrate the refine_netflow job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) [16:18:50] (03PS4) 10Btullis: Migrate sqoop jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195784 (https://phabricator.wikimedia.org/T402943) [16:18:50] (03PS4) 10Btullis: Migrate the data_purge jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195785 (https://phabricator.wikimedia.org/T402943) [16:18:50] (03PS4) 10Btullis: Migrate refine_sanitize jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943) [16:19:10] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on releases1003.eqiad.wmnet with reason: reboot [16:19:41] !log rebooting backend of releases.wikimedia.org [16:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:41] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [16:21:17] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [16:22:04] (03Merged) 10jenkins-bot: Add inter.link transit on cr1-drmrs and set up community for anti-ddos [homer/public] - 10https://gerrit.wikimedia.org/r/1196118 (https://phabricator.wikimedia.org/T401104) (owner: 10Cathal Mooney) [16:24:22] (03CR) 10Btullis: [C:03+2] Enable notifications for an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195775 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [16:26:04] (03CR) 10BCornwall: [C:03+1] lvs1018: remove L2 sub-interface config for row E/F vlans [puppet] - 10https://gerrit.wikimedia.org/r/1191109 (https://phabricator.wikimedia.org/T405499) (owner: 10Cathal Mooney) [16:27:25] (03CR) 10BCornwall: [C:03+1] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1196078 (owner: 10Muehlenhoff) [16:27:31] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:restbase-codfw [16:27:56] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [16:28:42] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [16:32:28] (03CR) 10Btullis: [C:03+1] airflow-main: reduce alotted CPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196111 (https://phabricator.wikimedia.org/T407191) (owner: 10Brouberol) [16:32:30] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:32:40] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:33:11] (03CR) 10Scott French: [C:03+2] rest-gateway: Divert PHP_ENGINE=8.3 requests to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194791 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [16:34:22] RESOLVED: [8x] ProbeDown: Service restbase2037-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:34:59] (03Merged) 10jenkins-bot: rest-gateway: Divert PHP_ENGINE=8.3 requests to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194791 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [16:35:23] (03PS4) 10Btullis: Enable canary events on an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195778 (https://phabricator.wikimedia.org/T402943) [16:35:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11273272 (10Dzahn) I think if the problem statement includes "I don't have any special knowledge of what the correct values should be" th... [16:35:34] (03PS4) 10Btullis: Migrate data_check refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195780 (https://phabricator.wikimedia.org/T402943) [16:35:42] (03PS4) 10Btullis: Migrate the hdfs_cleaner refinery jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195781 (https://phabricator.wikimedia.org/T402943) [16:35:43] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11273274 (10cmooney) >>! In T396065#11265651, @VRiley-WMF wrote: > Hey @cmooney Just checked it, and I apologize. It wasn't plugged in yet, however, that's been corrected. Yep that is looking... [16:35:51] (03PS4) 10Btullis: Migrate the import_*_dumps systemd jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195782 (https://phabricator.wikimedia.org/T402943) [16:36:03] (03PS4) 10Btullis: Migrate the project_namespace_map refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195783 (https://phabricator.wikimedia.org/T402943) [16:36:10] (03PS5) 10Btullis: Migrate the refine_netflow job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) [16:36:18] (03PS5) 10Btullis: Migrate sqoop jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195784 (https://phabricator.wikimedia.org/T402943) [16:36:27] (03PS5) 10Btullis: Migrate the data_purge jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195785 (https://phabricator.wikimedia.org/T402943) [16:36:28] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:36:34] (03PS5) 10Btullis: Migrate refine_sanitize jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943) [16:36:37] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:37:03] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1195778 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [16:38:02] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1195780 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [16:38:33] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1195781 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [16:38:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11273288 (10BCornwall) @cmooney That looks good to me. As to whether /etc/network/interfaces will be reconfigured... I believe it will *not* reconfig... [16:39:52] (03CR) 10MVernon: "Hi," [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1196119 (owner: 10BCornwall) [16:40:33] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1195782 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [16:41:20] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1195783 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [16:41:35] (03PS3) 10Jdlrobson: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187491 [16:43:25] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:43:35] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [16:43:57] (03PS2) 10BCornwall: Add wmf-debci trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1196119 [16:44:07] (03CR) 10BCornwall: Add wmf-debci trixie image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1196119 (owner: 10BCornwall) [16:46:59] (03PS8) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [16:46:59] (03PS8) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [16:46:59] (03PS7) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [16:50:29] (03CR) 10CI reject: [V:04-1] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [16:51:02] (03PS1) 10Ebernhardson: Revert "cirrus: Start AB test of did-you-mean profiles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196127 (https://phabricator.wikimedia.org/T390858) [16:52:24] (03CR) 10Dzahn: "the patch itself looks good to me. needs manager approval on ticket though" [puppet] - 10https://gerrit.wikimedia.org/r/1196010 (https://phabricator.wikimedia.org/T407187) (owner: 10JavierMonton) [16:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:53:10] FIRING: BFDdown: BFD session down between cr2-eqord and fe80::8618:88ff:fe0d:d944 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:53:36] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for JavierMonton - https://phabricator.wikimedia.org/T407187#11273394 (10Dzahn) @JMonton-WMF Your patch looks good to me. This will need approval from 2 people on this ticket though. Your mana... [16:54:31] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for JavierMonton - https://phabricator.wikimedia.org/T407187#11273407 (10Dzahn) @Ahoelzl Your approval is needed for this access request. Please take a look. [16:55:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11273410 (10cmooney) >>! In T405499#11273288, @BCornwall wrote: > @cmooney That looks good to me. As to whether /etc/network/interfaces will be recon... [16:55:11] (03PS3) 10Andrew Bogott: wmcs-annual-purge.py: update to catch up with various openstack changes [puppet] - 10https://gerrit.wikimedia.org/r/1196114 [16:55:50] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [16:55:58] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [16:56:32] (03CR) 10Andrew Bogott: [C:03+2] wmcs-annual-purge.py: update to catch up with various openstack changes [puppet] - 10https://gerrit.wikimedia.org/r/1196114 (owner: 10Andrew Bogott) [16:57:09] (03PS9) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [16:57:09] (03PS9) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [16:57:09] (03PS8) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [16:58:10] RESOLVED: BFDdown: BFD session down between cr2-eqord and fe80::8618:88ff:fe0d:d944 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:58:50] 06SRE, 06Infrastructure-Foundations, 10netops: drmrs: cr1-drmrs <-> asw1-b13-drmrs link down [Oct 2025] - https://phabricator.wikimedia.org/T407107#11273438 (10cmooney) 05Open→03Resolved Closing this, Rob opened T407263 for the replacement optic. [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T1700) [17:00:18] (03CR) 10CI reject: [V:04-1] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [17:01:01] (03CR) 10Mforns: Migrate the refine_netflow job to an-launcher1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [17:01:37] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1195784 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [17:02:10] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1195785 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [17:02:28] FYI, Ichanges as part of the infra window. [17:02:41] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [17:02:49] FYI, I'll likely be deploying some changes as part of the infra window. [17:03:02] just monitoring some earlier ones before proceeding [17:03:24] (03CR) 10Cathal Mooney: lvs1018: remove L2 sub-interface config for row E/F vlans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191109 (https://phabricator.wikimedia.org/T405499) (owner: 10Cathal Mooney) [17:08:37] (03PS1) 10Cathal Mooney: cr2-eqiad: add EBGP peering to ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1196131 (https://phabricator.wikimedia.org/T396065) [17:09:46] (03PS10) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [17:09:47] (03PS10) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [17:09:47] (03PS9) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [17:09:52] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194716 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:09:53] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): Scale next releases to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194716 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:10:17] (03CR) 10CI reject: [V:04-1] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [17:12:17] (03Merged) 10jenkins-bot: mw-(api-ext|web): Scale next releases to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194716 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:14:08] (03PS11) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [17:14:08] (03PS11) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [17:14:08] (03PS10) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [17:14:35] (03CR) 10CI reject: [V:04-1] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [17:15:07] !log swfrench@deploy2002 Started scap sync-world: Non-image-build scap run to scale 8.3 deployments - T405955 [17:15:13] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:16:22] (03PS12) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [17:16:22] (03PS12) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [17:16:22] (03PS11) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [17:19:35] !log swfrench@deploy2002 Finished scap sync-world: Non-image-build scap run to scale 8.3 deployments - T405955 (duration: 05m 41s) [17:19:48] swfrench-wmf: if you end up having any unused time at the end of your window, lmk and I'll charlie some stuff, but no worries if you don't :) [17:20:01] (03CR) 10Btullis: Migrate the refine_netflow job to an-launcher1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [17:21:27] (03PS1) 10Andrew Bogott: wmcs-annual-purge.py: add os deprecation callouts [puppet] - 10https://gerrit.wikimedia.org/r/1196132 [17:22:12] (03CR) 10CI reject: [V:04-1] wmcs-annual-purge.py: add os deprecation callouts [puppet] - 10https://gerrit.wikimedia.org/r/1196132 (owner: 10Andrew Bogott) [17:23:13] rzl: I _might_ run a scap backport in the remaining time, but I think it's probably fine if you get started in the interim (i.e., low likelihood of conflict) [17:24:03] * swfrench-wmf just realized the train rolled to group0 earlier today [17:24:04] cool :) nobody's after you so I'm also happy to wait, whatever's more comfortable [17:24:20] we've got *loads* of time, heh [17:24:39] (03CR) 10Cathal Mooney: [C:03+2] cr2-eqiad: add EBGP peering to ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1196131 (https://phabricator.wikimedia.org/T396065) (owner: 10Cathal Mooney) [17:25:05] (03CR) 10Brouberol: [C:03+2] airflow: allow the deployment of the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196028 (https://phabricator.wikimedia.org/T406958) (owner: 10Brouberol) [17:25:22] (03CR) 10Brouberol: [C:03+2] airflow-main: reduce alotted CPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196111 (https://phabricator.wikimedia.org/T407191) (owner: 10Brouberol) [17:26:23] (03Merged) 10jenkins-bot: cr2-eqiad: add EBGP peering to ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1196131 (https://phabricator.wikimedia.org/T396065) (owner: 10Cathal Mooney) [17:27:42] (03Merged) 10jenkins-bot: airflow: allow the deployment of the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196028 (https://phabricator.wikimedia.org/T406958) (owner: 10Brouberol) [17:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:30:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Create boot environment of Bullseye with a 6.1 kernel - https://phabricator.wikimedia.org/T405102#11273708 (10ssingh) Traffic discussed this in the team meeting today. We decided that given the above blocker, we should simply move to trixie and use OpenSSL (3.5.0) a... [17:32:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11273718 (10ssingh) Cross-posting the comment from T405102#11273708, > Traffic discussed this in the team meeting today. We decided that given the above blocke... [17:34:46] (03CR) 10TChin: [C:03+2] [mw-enrichment] Bump to v1.42.0 and Flink 1.20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195223 (https://phabricator.wikimedia.org/T401725) (owner: 10TChin) [17:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:36:43] (03Merged) 10jenkins-bot: [mw-enrichment] Bump to v1.42.0 and Flink 1.20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195223 (https://phabricator.wikimedia.org/T401725) (owner: 10TChin) [17:37:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11273750 (10RobH) >>! In T405647#11250698, @RobH wrote: > @klausman, > > Can you provide feedback on when we can migrate these hosts from one network port to th... [17:38:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Create boot environment of Bullseye with a 6.1 kernel - https://phabricator.wikimedia.org/T405102#11273754 (10ssingh) 05Open→03Resolved a:03ssingh Thanks @MoritzMuehlenhoff for working on this and researching it. I am closing this for the reason mentioned... [17:39:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11273760 (10BCornwall) I'm fine with manually editing the interfaces but ensuring that the next time there's an install run it'll properly enumerate. [17:40:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11273763 (10ssingh) FWIW we have typically reimaged for this in the past. I am not suggesting, just sharing! And given that this is lvs1020, that mig... [17:40:59] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [17:41:10] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:41:19] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11273764 (10RobH) @dzahn, We've pushed the start of this migration back to the first of November. Can we get updated date/time preference from you for t... [17:45:53] rzl: apologies for the delay - just needed to get a couple of dashboards together. I'll get started with that backport shortly. [17:48:36] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [17:48:42] swfrench-wmf: no worries! [17:48:51] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [17:49:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11273802 (10RobH) Updating https://docs.google.com/spreadsheets/d/13ow4JxrsQdz8KSsdBBNwvlrAuGKo8OHWcnR4RhXTYc0/edit?usp=sharing with the details of the mov... [17:50:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194718 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:51:35] (03Merged) 10jenkins-bot: Enroll 1% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194718 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:52:08] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1194718|Enroll 1% of client sessions in PHP 8.3 (T405955)]] [17:52:12] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:52:55] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11273811 (10Dzahn) @RobH Sure! I suggested Monday, Nov 3 and updated in Google calendar to see if this works for everyone. [17:56:18] (03CR) 10Majavah: Remove a deprecation warning for datetime in _menu.py (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1194213 (https://phabricator.wikimedia.org/T401581) (owner: 10Elukey) [17:56:37] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1194718|Enroll 1% of client sessions in PHP 8.3 (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:01:39] !log swfrench@deploy2002 swfrench: Continuing with sync [18:02:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11273854 (10RobH) >>! In T405945#11250904, @LSobanski wrote: > @RobH here's a summary of what needs to happen with the hosts, @cmooney will be coo... [18:03:06] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum1001.eqiad.wmnet with OS trixie [18:03:19] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum2002.codfw.wmnet with OS trixie [18:03:34] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum4002.ulsfo.wmnet with OS trixie [18:03:51] (03PS1) 10Majavah: remote: Support timezone-aware objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196139 (https://phabricator.wikimedia.org/T401581) [18:06:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11273867 (10RobH) Sent a followup via email to Cole and Keith today: > Keith / Cole, > > I assigned https://phabricator.wikimedia.org/T405946 over to you both for feedb... [18:08:10] FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:09:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11273871 (10RobH) @bking and @gehel, We've not heard anything back on this since my IRC chat with @gehel on Sept 29th.... [18:10:16] 06SRE, 10DNS, 10Domains, 06Traffic: Request to create the 25.wikipedia.org domain + 301 redirect to the org site - https://phabricator.wikimedia.org/T407156#11273873 (10BCornwall) https://sites.google.com/wikimedia.org/wp25-foundation-celebration/ recently launched per a org-wide email. I contacted them ab... [18:10:51] (03PS1) 10BCornwall: mediawiki/httpbb: Add 25.wikipedia.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/1196141 (https://phabricator.wikimedia.org/T407156) [18:11:14] (03PS1) 10BCornwall: Add 25.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/1196142 (https://phabricator.wikimedia.org/T407156) [18:11:26] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194718|Enroll 1% of client sessions in PHP 8.3 (T405955)]] (duration: 19m 18s) [18:11:29] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:11:51] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [18:11:58] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [18:12:25] rzl: all yours! thanks for your patience [18:12:59] swfrench-wmf: ack thanks! [18:13:10] FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:13:30] (03PS2) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1192647 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [18:14:39] (03CR) 10Ssingh: [C:03+1] Add 25.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/1196142 (https://phabricator.wikimedia.org/T407156) (owner: 10BCornwall) [18:15:38] (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1192647 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [18:15:54] 06SRE, 10DNS, 10Domains, 06Traffic, 13Patch-For-Review: Request to create the 25.wikipedia.org domain + 301 redirect to the org site - https://phabricator.wikimedia.org/T407156#11273889 (10Dzahn) >>! In T407156#11273872, @BCornwall wrote: > https://sites.google.com/wikimedia.org/wp25-foundation-celebrati... [18:16:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11273891 (10RobH) @Clement_Goubert, Just checking in as there hasn't been any update to the google sheet for the #serviceops hosts yet. I've added the notes for wikikube-ctrl1... [18:17:35] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage [18:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:18:10] !log brett@dns1004 START - running authdns-update [18:19:23] !log brett@dns1004 END - running authdns-update [18:19:42] 06SRE, 10DNS, 10Domains, 06Traffic, 13Patch-For-Review: Request to create the 25.wikipedia.org domain + 301 redirect to the org site - https://phabricator.wikimedia.org/T407156#11273895 (10Dzahn) @SCampos-WMF Do we really plan to support a site on Google sites forever? Did we move away from VIP Wordpress... [18:22:29] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [18:22:34] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage [18:22:37] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [18:22:47] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [18:23:19] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [18:23:22] 06SRE, 10DNS, 10Domains, 06Traffic, 13Patch-For-Review: Request to create the 25.wikipedia.org domain + 301 redirect to the org site - https://phabricator.wikimedia.org/T407156#11273902 (10BCornwall) The google sites page is secondary to this ticket - it leaves many questions, for sure... but I'm operati... [18:23:23] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage [18:26:45] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage [18:27:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11273917 (10herron) Added details to the spreadsheet thanks! [18:28:42] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/apertium: apply [18:28:49] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4002.ulsfo.wmnet with reason: host reimage [18:29:15] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [18:29:33] (03CR) 10BCornwall: [C:03+2] Add 25.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/1196142 (https://phabricator.wikimedia.org/T407156) (owner: 10BCornwall) [18:29:54] !log brett@dns1004 START - running authdns-update [18:31:05] !log brett@dns1004 END - running authdns-update [18:31:20] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [18:31:38] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [18:31:46] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:31:48] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [18:32:04] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [18:32:22] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4002.ulsfo.wmnet with reason: host reimage [18:32:33] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:32:43] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [18:34:43] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [18:34:46] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [18:35:13] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [18:36:10] that HelmReleaseBadStatus alert is interesting, looking [18:36:41] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:36:56] mw-script/amfcta11 [18:37:07] mw-script/amfcta11 is a job that completed last Wednesday, weird [18:37:21] completed successfully, even [18:38:46] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1001.eqiad.wmnet with OS trixie [18:40:35] and doesn't appear at all in `helm list` [18:40:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T406627) (owner: 10LorenMora) [18:41:31] but I guess just because Helm thinks it's not deployed, it does show as `pending-install` in `helm history amfcta11` [18:42:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:43:10] FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:43:23] since the job is long since completed I think I'm just going to `helm uninstall` it to clean up the alert [18:44:22] RESOLVED: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:44:35] !log rzl@deploy1003:~$ kube-env mw-script-deploy codfw; helm uninstall amfcta11 # HelmReleaseBadStatus alert was firing for this mw-script job in state pending-install, even though the job was long since finished [18:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:59] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2002.codfw.wmnet with OS trixie [18:46:38] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4002.ulsfo.wmnet with OS trixie [18:46:53] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:46:56] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:48:08] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [18:48:33] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [18:49:06] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply [18:49:21] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply [18:51:14] (03PS2) 10Andrew Bogott: wmcs-annual-purge.py: add os deprecation callouts [puppet] - 10https://gerrit.wikimedia.org/r/1196132 [18:51:29] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [18:52:00] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [18:52:43] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [18:52:59] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [18:53:10] RESOLVED: [12x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:53:23] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [18:53:38] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [18:54:06] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/echostore: apply [18:55:01] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [18:55:16] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [18:55:29] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [18:55:45] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [18:55:58] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [18:57:12] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [18:57:35] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [18:59:20] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [18:59:41] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [18:59:48] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum5002.eqsin.wmnet with OS trixie [19:00:04] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [19:00:17] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [19:01:13] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [19:01:32] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [19:02:43] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [19:03:10] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [19:03:33] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [19:03:57] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [19:04:39] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [19:04:52] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [19:05:23] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [19:05:40] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [19:06:07] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: apply [19:06:32] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [19:08:07] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [19:08:39] (03PS33) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [19:08:40] FIRING: [8x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:08:50] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [19:09:24] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [19:12:25] (03PS1) 10CDanis: varnish: WMF-Uniq -> Analytics: fix frequency bug [puppet] - 10https://gerrit.wikimedia.org/r/1196154 (https://phabricator.wikimedia.org/T405783) [19:15:04] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [19:19:38] jouncebot: nowandnext [19:19:38] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [19:19:38] In 0 hour(s) and 40 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T2000) [19:25:06] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [19:28:42] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mathoid: apply [19:29:02] 06SRE, 10DNS, 10Domains, 06Traffic, 13Patch-For-Review: Request to create the 25.wikipedia.org domain + 301 redirect to the org site - https://phabricator.wikimedia.org/T407156#11274194 (10BCornwall) @SCampos-WMF The changes are ready to go - when can we expect a functional page to be posted? Would you p... [19:29:11] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [19:29:17] Reedy: I'm still working down the list with these envoy upgrades but happy to pause if you need :) [19:29:31] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [19:29:44] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [19:29:51] rzl: Nothing urgent from me :) [19:29:56] 👍 [19:30:23] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:32:49] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:34:42] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [19:34:55] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [19:35:06] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [19:35:33] (03PS1) 10DLynch: Suggestions mode [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196155 (https://phabricator.wikimedia.org/T399612) [19:35:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196155 (https://phabricator.wikimedia.org/T399612) (owner: 10DLynch) [19:35:53] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [19:36:05] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [19:36:30] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [19:38:15] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [19:38:39] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [19:38:54] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [19:39:10] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [19:39:30] (03CR) 10Andrew Bogott: [C:03+2] wmcs-annual-purge.py: add os deprecation callouts [puppet] - 10https://gerrit.wikimedia.org/r/1196132 (owner: 10Andrew Bogott) [19:39:49] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [19:40:26] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [19:40:53] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [19:41:16] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [19:49:20] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [19:49:35] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [19:50:05] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [19:50:20] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [19:50:48] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [19:51:11] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [19:51:25] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [19:51:51] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [19:53:20] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [19:53:53] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [19:54:16] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/termbox: apply [19:54:45] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [19:55:08] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/toolhub: apply [19:55:17] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [19:55:41] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5002.eqsin.wmnet with reason: host reimage [19:56:06] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [19:56:23] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [19:56:37] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/zotero: apply [19:56:58] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [19:57:10] done! [19:58:13] (03PS1) 10Kimberly Sarabia: Set reader experiment to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196166 (https://phabricator.wikimedia.org/T406916) [19:59:03] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5002.eqsin.wmnet with reason: host reimage [19:59:36] (03CR) 10Eric Gardner: [C:03+1] Set reader experiment to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196166 (https://phabricator.wikimedia.org/T406916) (owner: 10Kimberly Sarabia) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T2000) [20:00:05] seanleong-wmde, toyofuku, and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] o/ [20:00:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196166 (https://phabricator.wikimedia.org/T406916) (owner: 10Kimberly Sarabia) [20:01:06] I can deploy myself. [20:01:54] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11274296 (10VRiley-WMF) They have responded back looking for site information for them to send a replacment. Will await for part to come onsite. [20:02:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196155 (https://phabricator.wikimedia.org/T399612) (owner: 10DLynch) [20:02:27] hello sorry I'm late [20:02:48] hello [20:02:57] hiiiiii [20:03:01] hi [20:03:04] Kemayo: have you started already? [20:03:17] toyofuku: Yes, mine's merging at the moment. [20:03:27] Sounds good [20:04:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11274303 (10VRiley-WMF) 05Open→03Resolved [20:04:30] (03Merged) 10jenkins-bot: Suggestions mode [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196155 (https://phabricator.wikimedia.org/T399612) (owner: 10DLynch) [20:05:01] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1196155|Suggestions mode (T399612)]] [20:05:06] T399612: Create VE Suggestion Mode MVP - https://phabricator.wikimedia.org/T399612 [20:08:59] marostegui@cumin1003 clone_es (PID 2658664) is awaiting input [20:09:21] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1196155|Suggestions mode (T399612)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:57] !log kemayo@deploy2002 kemayo: Continuing with sync [20:12:54] (03PS1) 10Kimberly Sarabia: ImageBrowsing: fix UI bugs in Overlay, DetailView and VTOC [extensions/ReaderExperiments] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196170 (https://phabricator.wikimedia.org/T405992) [20:14:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196170 (https://phabricator.wikimedia.org/T405992) (owner: 10Kimberly Sarabia) [20:17:01] (03CR) 10Dzahn: "on a ticket I saw a comment that they dont want to use these domains after all" [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [20:17:44] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [20:17:48] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196155|Suggestions mode (T399612)]] (duration: 12m 47s) [20:17:52] T399612: Create VE Suggestion Mode MVP - https://phabricator.wikimedia.org/T399612 [20:18:09] I think about 90% of this deployment was "waiting for the last 3-5% of two different k8s deploys". [20:18:21] Anyway, toyofuku, you're up. [20:18:35] seanleong-wmde: do you have a deployer/will you be deploying? [20:18:40] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:18:52] I was second not first [20:18:52] (03PS1) 10Bking: ingress: remove reference to defunct template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196174 (https://phabricator.wikimedia.org/T406876) [20:19:10] mine is just a patch revert [20:19:24] Ah, does it not need to be backported? [20:19:46] it does haha sry, backport to group0 [20:20:00] as the train is alrdy halfway out [20:20:10] Do you have someone who will be doing that deploy? [20:20:20] I'm trained but only for deploying my own team's patches [20:20:34] (03PS2) 10Bking: ingress: remove reference to defunct template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196174 (https://phabricator.wikimedia.org/T406876) [20:20:38] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 4.727 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [20:20:47] i don't have it now [20:20:54] Gotcha [20:20:59] is it still possible to do it? [20:21:06] Would you mind if Kim and I go ahead with our deploys then while you find someone? [20:21:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:21:12] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum5002.eqsin.wmnet with OS trixie [20:21:32] Unless someone in the channel is available to do that deploy for you [20:21:53] yea sure, but I don't have anyone now as it's pass working hours [20:22:30] oh no [20:22:33] Okay I'm starting mine [20:22:41] I'll try to find [20:22:43] How urgent is it that your patch get reverted right now? [20:23:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T406627) (owner: 10LorenMora) [20:23:57] I'm around if need help [20:24:01] <3 [20:24:41] It'll be to revert before the next train if anyone could help, if not that's fine as well, I can schedule another one tmr morning [20:24:45] Kicking off mine in the meantime to get through this list as fast as we (safely) can [20:24:46] Thanks Amir1! [20:24:50] (03Merged) 10jenkins-bot: Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T406627) (owner: 10LorenMora) [20:25:21] !log toyofuku@deploy2002 Started scap sync-world: Backport for [[gerrit:1193445|Add ReadingList Stream to EventStreamConfig (T406627)]] [20:25:25] T406627: Reading Lists Instrumentation - Contextual Attributes - Create Stream [Deploy and QA ] - https://phabricator.wikimedia.org/T406627 [20:25:26] and thanks toyofuku too! [20:25:44] ofc!! Mine's a config deploy so hopefully quick 🤞 [20:26:10] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:26:14] Enough time for you to stream Golden from Kpop Demon Hunters on spotify 1-2 times [20:28:03] doing it atm, it's gonna be, gonna be golden! [20:28:07] xD [20:28:13] lol [20:28:28] 💛💛💛 [20:28:44] (ping me if you need me) [20:28:58] I also need help deploying mine but if it's past the time, i can reschedule for tomorrow. Mine was added last minute [20:29:37] !log toyofuku@deploy2002 lmora, toyofuku: Backport for [[gerrit:1193445|Add ReadingList Stream to EventStreamConfig (T406627)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:30:00] Testing quickly [20:30:34] !log toyofuku@deploy2002 lmora, toyofuku: Continuing with sync [20:31:36] Eventlogging code is commented out so no way to test the stream - I reloaded wikipedia and made sure nothing obvious was wrong [20:31:51] Looking at events on testservers while the code syncs [20:33:13] Still seeing events so hopefully we're good [20:36:05] toyofuku: if there is nothing afterwards. Feel free to take your time. Don't rush [20:36:12] jouncebot: nowandnext [20:36:12] For the next 0 hour(s) and 23 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T2000) [20:36:12] In 0 hour(s) and 23 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T2100) [20:36:42] Lol Kim and I are former web so I think between the two of us we can say that time is not needed [20:37:06] It's late in Germany though!! I don't want seanleong-wmde being kept past what we need to [20:37:13] (03CR) 10Ssingh: "@dzahn@wikimedia.org: where are you seeing that? Per the ticket, "Update: It looks like the following domains wiki25.com, wiki25.org, wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [20:37:16] Who knows how tagging in IRC works [20:37:19] !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193445|Add ReadingList Stream to EventStreamConfig (T406627)]] (duration: 11m 58s) [20:37:23] T406627: Reading Lists Instrumentation - Contextual Attributes - Create Stream [Deploy and QA ] - https://phabricator.wikimedia.org/T406627 [20:37:32] Total time 14 minutes [20:37:45] 5 more streams to golden [20:37:59] no worries, take your time :D [20:38:03] Amir1: all you [20:41:27] Hi Amir1, would you be able to help out with the deployment? [20:41:34] sure [20:41:38] one second [20:41:43] thanks! no worries take ur time [20:41:53] this https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1196101? [20:42:03] yes [20:42:08] (03CR) 10Ladsgroup: [C:03+2] Revert "Add icons for wikibase changes. WIP" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196101 (owner: 10Neslihan Turan) [20:43:44] (03PS34) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [20:48:04] (03CR) 10Dzahn: "first part of https://phabricator.wikimedia.org/T407156#11273872 but second part is about creating redirects.. so I guess that _is_ using" [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [20:48:45] hi [20:48:55] ignore sry, mistyped [20:50:21] (03CR) 10BCornwall: "To be clear, that comment was that the internal google sites domain was not involved with these domains. The redirects are still wanted." [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [20:50:38] (03CR) 10BCornwall: "Resolved on another thread." [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [20:51:41] (03CR) 10Dzahn: "that being said, I would normally not recommend to add multiple domains to point to identical content. It is usually bad for SEO, rather t" [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [20:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:52:52] (03CR) 10Ssingh: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [20:53:38] (03CR) 10BCornwall: "I agree, and would argue that these should have not have been registered in the first place. But that's beyond my grasp 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [20:54:28] (03PS1) 10Scott French: mw-(api-ext|web): Increase maxSurge on 8.3 "next" releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196176 (https://phabricator.wikimedia.org/T405955) [20:54:30] (03CR) 10Ssingh: [C:03+1] "That's a TIL for me for sure but also we have done the same for many of these redirects, plus I do wonder how much we should care about th" [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [20:56:13] (03CR) 10BCornwall: "Sir, the trigger-happiness of non-technical teams to register domains are above my station, sir!" [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [20:57:53] (03CR) 10RLazarus: [C:03+1] mw-(api-ext|web): Increase maxSurge on 8.3 "next" releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196176 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:58:31] (03Merged) 10jenkins-bot: Revert "Add icons for wikibase changes. WIP" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196101 (owner: 10Neslihan Turan) [20:59:21] Saw it, thanks Amir1 [20:59:27] <3 [20:59:37] it's being deployed [20:59:44] soon will be in mwdebug [20:59:48] do you know what mwdebug is? [20:59:52] yupp [20:59:56] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1196101|Revert "Add icons for wikibase changes. WIP"]] [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251014T2100) [21:00:08] Amir1: FYI, I'm going to merge a deployment-charts change in the background that should ideally speed up the deployment when you get to the prod-k8s step. no action needed on your part. [21:00:20] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): Increase maxSurge on 8.3 "next" releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196176 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [21:00:23] awesome [21:00:31] thanks swfrench-wmf for being amazing <3 [21:01:16] marostegui@cumin1003 clone_es (PID 2653801) is awaiting input [21:01:27] don't thank me until we see how well it works :) (tl;dr - the additional 3-4 minutes of slowness is a temporary artifact of the 8.3 migration) [21:02:23] (03Merged) 10jenkins-bot: mw-(api-ext|web): Increase maxSurge on 8.3 "next" releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196176 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [21:03:17] (03PS35) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [21:04:33] !log ladsgroup@deploy2002 neslihanturan, ladsgroup: Backport for [[gerrit:1196101|Revert "Add icons for wikibase changes. WIP"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:04:44] Amir1: can you let me know when you are done with the deployment window? I have a small beta cluster change I'd like to land [21:05:14] alright, https://gerrit.wikimedia.org/r/1196176 is live, so it should get picked up passively when you continue the sync onward to prod [21:05:15] Jdlrobson: beta cluster changes can go at any time, just need a rebase on deployment host which I can do [21:05:24] seanleong-wmde: it's live in mwdebug, please test [21:05:30] on it now [21:05:54] what change do you have Jon? I can just land it (it'll take ten minutes to show up in beta cluster) [21:05:56] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1187491?usp=search < Amir1 [21:06:00] thanks! [21:06:53] (03CR) 10Ladsgroup: [C:03+2] Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187491 (owner: 10Jdlrobson) [21:07:45] (03Merged) 10jenkins-bot: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187491 (owner: 10Jdlrobson) [21:08:28] Jdlrobson: rebased. You'll have it in 10 minutes ish [21:10:20] looks good [21:10:24] !log ladsgroup@deploy2002 neslihanturan, ladsgroup: Continuing with sync [21:12:03] (03CR) 10Ladsgroup: [C:03+2] ImageBrowsing: fix UI bugs in Overlay, DetailView and VTOC [extensions/ReaderExperiments] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196170 (https://phabricator.wikimedia.org/T405992) (owner: 10Kimberly Sarabia) [21:12:56] kimberly_sarabia: do you want the backport to go first (as requirement for enabling it) or they are independent? [21:13:03] (03Merged) 10jenkins-bot: ImageBrowsing: fix UI bugs in Overlay, DetailView and VTOC [extensions/ReaderExperiments] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196170 (https://phabricator.wikimedia.org/T405992) (owner: 10Kimberly Sarabia) [21:13:41] thx Amir1 [21:13:43] Amir1: yes they're related. the backport can go first [21:14:33] okay, it's gonna take a while for gate to finish though :D [21:14:51] well, I was wrong [21:14:55] that was blazing fast [21:16:30] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196101|Revert "Add icons for wikibase changes. WIP"]] (duration: 16m 34s) [21:16:30] (03CR) 10Ssingh: [C:03+1] "my +1 to merge at least." [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [21:16:36] thanks Amir1 [21:16:44] \o/ [21:17:04] Thank you for flying with us. Hope to see you next time! [21:17:09] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1196170|ImageBrowsing: fix UI bugs in Overlay, DetailView and VTOC (T405992)]] [21:17:12] T405992: Image Browsing: UI Bug Bash - https://phabricator.wikimedia.org/T405992 [21:18:44] Literally a picture of me deploying [21:18:47] https://usercontent.irccloud-cdn.com/file/of5gr0z8/grafik.png [21:18:53] haha [21:18:55] (https://phabricator.wikimedia.org/project/view/1449/) [21:19:32] !log ladsgroup@deploy2002 ksarabia, ladsgroup: Backport for [[gerrit:1196170|ImageBrowsing: fix UI bugs in Overlay, DetailView and VTOC (T405992)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:19:39] kimberly_sarabia: it's in mwdebug [21:20:05] taking a look! [21:20:28] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1192649 (owner: 10Ncmonitor) [21:20:52] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (https://phabricator.wikimedia.org/T407156) (owner: 10Ncmonitor) [21:21:47] (03PS1) 10TChin: [mw-enrichment] Bump page change schema to 1.3.0 to pick up user_central_id [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196179 (https://phabricator.wikimedia.org/T401725) [21:22:33] hahaha it was a great flight o7 [21:25:27] Amir1: LGTM. [21:25:30] !log ladsgroup@deploy2002 ksarabia, ladsgroup: Continuing with sync [21:25:45] scapping then! [21:26:34] (03CR) 10Ladsgroup: [C:03+2] Set reader experiment to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196166 (https://phabricator.wikimedia.org/T406916) (owner: 10Kimberly Sarabia) [21:27:23] (03Merged) 10jenkins-bot: Set reader experiment to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196166 (https://phabricator.wikimedia.org/T406916) (owner: 10Kimberly Sarabia) [21:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:31:31] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196170|ImageBrowsing: fix UI bugs in Overlay, DetailView and VTOC (T405992)]] (duration: 14m 22s) [21:31:35] T405992: Image Browsing: UI Bug Bash - https://phabricator.wikimedia.org/T405992 [21:32:12] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1196166|Set reader experiment to true (T406916)]] [21:32:16] T406916: Reader Experiments: Deploy extension to production Arabic, Vietnamese, French, Chinese, Indonesian Wikipedia - https://phabricator.wikimedia.org/T406916 [21:32:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11274549 (10RobH) @bking pointed out a valid observcation, C4 and C7 racks leverage 10G switches. However, we are repla... [21:34:36] !log ladsgroup@deploy2002 ksarabia, ladsgroup: Backport for [[gerrit:1196166|Set reader experiment to true (T406916)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:34:55] kimberly_sarabia: live in mwdebug [21:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:36:19] taking a look [21:38:41] Amir1: LGTM [21:38:44] !log ladsgroup@deploy2002 ksarabia, ladsgroup: Continuing with sync [21:38:51] weeee [21:41:08] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228#11274589 (10Ladsgroup) [21:41:55] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228#11274591 (10Ladsgroup) This is for @thcipriani and @Ahoelzl to approve [21:42:57] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1195784 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [21:43:38] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196166|Set reader experiment to true (T406916)]] (duration: 11m 26s) [21:43:42] T406916: Reader Experiments: Deploy extension to production Arabic, Vietnamese, French, Chinese, Indonesian Wikipedia - https://phabricator.wikimedia.org/T406916 [21:44:45] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1195785 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [21:45:40] logs look good, things are clean. I can go and dream 💤 [21:45:53] Amir1: Thanks so much [21:52:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11274619 (10bking) Thanks @RobH , here are the instructions: **cirrussearch** Depool hosts 2 minutes before the procedu... [21:53:59] (03CR) 10Aleksandar Mastilovic: Migrate the refine_netflow job to an-launcher1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [21:55:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11274620 (10bking) a:05bking→03RobH [21:55:18] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1195778 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [21:56:03] (03CR) 10Aleksandar Mastilovic: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1195780 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [21:56:35] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1195781 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [21:59:07] (03CR) 10Scott French: [C:03+1] "Nice! This is quite readable." [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [22:02:39] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1195782 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [22:03:16] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1195783 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [22:07:54] (03CR) 10Xcollazo: [C:03+1] [mw-enrichment] Bump page change schema to 1.3.0 to pick up user_central_id [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196179 (https://phabricator.wikimedia.org/T401725) (owner: 10TChin) [22:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:20:41] (03Abandoned) 10Andrew Bogott: openstack: wmfkeystonehooks: create LDAP groups with project name [puppet] - 10https://gerrit.wikimedia.org/r/1090854 (https://phabricator.wikimedia.org/T379030) (owner: 10Arturo Borrero Gonzalez) [22:25:48] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [22:26:40] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 1.293 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:40:58] (03PS1) 10Kimberly Sarabia: Add reader exp to common settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196192 (https://phabricator.wikimedia.org/T406916) [22:42:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:49:36] (03CR) 10Eric Gardner: [C:03+1] Add reader exp to common settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196192 (https://phabricator.wikimedia.org/T406916) (owner: 10Kimberly Sarabia) [22:55:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11274739 (10Jclark-ctr) @robh @vriley where we using this ticket as a placeholder for ordering replacement switch for c1? Unsure if this should be c... [23:06:29] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11274755 (10KFrancis) The NDA is complete. Thanks! [23:21:19] (03PS1) 10MusikAnimal: wish-index: pass in wishesData so that initial filters are set [extensions/CommunityRequests] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196194 (https://phabricator.wikimedia.org/T400945) [23:21:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196194 (https://phabricator.wikimedia.org/T400945) (owner: 10MusikAnimal) [23:31:17] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [23:32:00] (03Merged) 10jenkins-bot: wish-index: pass in wishesData so that initial filters are set [extensions/CommunityRequests] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196194 (https://phabricator.wikimedia.org/T400945) (owner: 10MusikAnimal) [23:32:30] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1196194|wish-index: pass in wishesData so that initial filters are set (T400945)]] [23:32:35] T400945: Add filters Vue app to wish index page - https://phabricator.wikimedia.org/T400945 [23:34:49] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1196194|wish-index: pass in wishesData so that initial filters are set (T400945)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:35:12] !log musikanimal@deploy2002 musikanimal: Continuing with sync [23:38:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1196196 [23:38:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1196196 (owner: 10TrainBranchBot) [23:39:38] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196194|wish-index: pass in wishesData so that initial filters are set (T400945)]] (duration: 07m 08s) [23:39:42] T400945: Add filters Vue app to wish index page - https://phabricator.wikimedia.org/T400945 [23:54:55] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1196196 (owner: 10TrainBranchBot)