[00:03:04] FIRING: CertManagerCertNotReady: Certificate opensearch-test/opensearch-cluster-cluster-wmf-opensearch is not in a ready state (k8s-dse@eqiad) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=opensearch-test - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [00:08:04] RESOLVED: CertManagerCertNotReady: Certificate opensearch-test/opensearch-cluster-cluster-wmf-opensearch is not in a ready state (k8s-dse@eqiad) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=opensearch-test - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [00:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196542 [00:08:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196542 (owner: 10TrainBranchBot) [00:14:13] jclark@cumin1002 reimage (PID 64990) is awaiting input [00:18:32] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1012.eqiad.wmnet with OS bullseye [00:19:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [00:28:51] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196542 (owner: 10TrainBranchBot) [00:34:35] 10ops-eqiad, 06SRE, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11279712 (10Jclark-ctr) a:03Jclark-ctr The server fails to boot normally. I was able to temporarily start it by booting from a single disk (sda–sdc), but no login attempts were successful. @eevans also confirmed that... [00:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:37:08] PROBLEM - Host wikikube-worker2203 is DOWN: PING CRITICAL - Packet loss = 100% [01:41:33] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:42:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:52:08] (03CR) 10Finchgold: [C:03+1] Add wgSitename for azwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196460 (https://phabricator.wikimedia.org/T407358) (owner: 10NMW03) [02:55:08] (03CR) 10Andrew Bogott: [C:03+2] prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 (owner: 10Andrew Bogott) [03:04:01] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2006-dev.codfw.wmnet with OS trixie [03:22:51] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2006-dev.codfw.wmnet with reason: host reimage [03:29:11] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2006-dev.codfw.wmnet with reason: host reimage [04:16:54] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2006-dev.codfw.wmnet with OS trixie [04:25:51] (03PS1) 10Marostegui: mariadb: Productionize db1262 [puppet] - 10https://gerrit.wikimedia.org/r/1196551 (https://phabricator.wikimedia.org/T406550) [04:26:24] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db1262 [puppet] - 10https://gerrit.wikimedia.org/r/1196551 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui) [04:30:27] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db1260.eqiad.wmnet onto db1262.eqiad.wmnet [04:30:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1260 - Depool db1260.eqiad.wmnet to then clone it to db1262.eqiad.wmnet - marostegui@cumin1003 [04:30:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1260 - Depool db1260.eqiad.wmnet to then clone it to db1262.eqiad.wmnet - marostegui@cumin1003 [04:32:36] (03PS1) 10Marostegui: instances.yaml: Add es1054 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196552 (https://phabricator.wikimedia.org/T406488) [04:33:07] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1054 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196552 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [04:35:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add es1054 to dbctl depooled T406488', diff saved to https://phabricator.wikimedia.org/P83949 and previous config saved to /var/cache/conftool/dbconfig/20251016-043510-marostegui.json [04:35:14] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [04:35:49] (03PS1) 10Marostegui: es1054: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196554 (https://phabricator.wikimedia.org/T406488) [04:36:34] (03CR) 10Marostegui: [C:03+2] es1054: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196554 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [04:38:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1054 (re)pooling @ 1%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83950 and previous config saved to /var/cache/conftool/dbconfig/20251016-043816-root.json [04:39:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2179 with weight 0 T407177', diff saved to https://phabricator.wikimedia.org/P83951 and previous config saved to /var/cache/conftool/dbconfig/20251016-043920-marostegui.json [04:39:24] T407177: Switchover s4 master (db2240 -> db2179) - https://phabricator.wikimedia.org/T407177 [04:39:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 37 hosts with reason: Primary switchover s4 T407177 [04:40:16] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1195827 (https://phabricator.wikimedia.org/T407177) (owner: 10Gerrit maintenance bot) [04:45:22] !log Starting s4 codfw failover from db2240 to db2179 - T407177 [04:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:25] T407177: Switchover s4 master (db2240 -> db2179) - https://phabricator.wikimedia.org/T407177 [04:45:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s4 codfw as read-only for maintenance - T407177', diff saved to https://phabricator.wikimedia.org/P83952 and previous config saved to /var/cache/conftool/dbconfig/20251016-044533-marostegui.json [04:45:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2179 to s4 primary and set section read-write T407177', diff saved to https://phabricator.wikimedia.org/P83953 and previous config saved to /var/cache/conftool/dbconfig/20251016-044557-marostegui.json [04:46:17] (03CR) 10Marostegui: [C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1195828 (https://phabricator.wikimedia.org/T407177) (owner: 10Gerrit maintenance bot) [04:46:21] !log marostegui@dns1006 START - running authdns-update [04:46:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2240 T407177', diff saved to https://phabricator.wikimedia.org/P83954 and previous config saved to /var/cache/conftool/dbconfig/20251016-044650-marostegui.json [04:47:36] !log marostegui@dns1006 END - running authdns-update [04:49:52] (03PS1) 10Marostegui: db2240: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1196555 (https://phabricator.wikimedia.org/T406541) [04:50:25] (03CR) 10Marostegui: [C:03+2] db2240: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1196555 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [04:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:53:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2240.codfw.wmnet with reason: Maintenance [04:53:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1054 (re)pooling @ 5%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83955 and previous config saved to /var/cache/conftool/dbconfig/20251016-045323-root.json [04:53:28] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [04:56:55] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:57:47] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 1.455 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:58:08] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [04:58:15] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#11279859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1003 for host sretest2003.codfw.wmnet with OS bookworm [04:59:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2240 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83956 and previous config saved to /var/cache/conftool/dbconfig/20251016-045946-root.json [05:01:53] (03PS1) 10Marostegui: db2248: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196556 (https://phabricator.wikimedia.org/T406551) [05:03:15] (03CR) 10Marostegui: [C:03+2] db2248: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196556 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [05:05:25] (03PS1) 10Marostegui: instances.yaml: Add db2248 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196557 (https://phabricator.wikimedia.org/T406551) [05:07:02] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2248 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196557 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [05:08:28] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:08:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1054 (re)pooling @ 7%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83957 and previous config saved to /var/cache/conftool/dbconfig/20251016-050829-root.json [05:08:34] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [05:09:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db2248 to dbctl depooled T406551', diff saved to https://phabricator.wikimedia.org/P83958 and previous config saved to /var/cache/conftool/dbconfig/20251016-050917-marostegui.json [05:09:22] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [05:10:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2248 (re)pooling @ 1%: Pooling 1P host in s4', diff saved to https://phabricator.wikimedia.org/P83959 and previous config saved to /var/cache/conftool/dbconfig/20251016-051015-root.json [05:14:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2240 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83960 and previous config saved to /var/cache/conftool/dbconfig/20251016-051452-root.json [05:16:52] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#11279873 (10Marostegui) 05Resolved→03Open @Jhancock.wm I am trying to reimage this host so we can use for our database testing but I am running into issues with the installer. Unfort... [05:23:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1054 (re)pooling @ 10%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83961 and previous config saved to /var/cache/conftool/dbconfig/20251016-052335-root.json [05:23:40] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [05:25:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2248 (re)pooling @ 5%: Pooling 1P host in s4', diff saved to https://phabricator.wikimedia.org/P83962 and previous config saved to /var/cache/conftool/dbconfig/20251016-052521-root.json [05:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2240 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83963 and previous config saved to /var/cache/conftool/dbconfig/20251016-052958-root.json [05:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:38:28] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1054 (re)pooling @ 20%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83964 and previous config saved to /var/cache/conftool/dbconfig/20251016-053840-root.json [05:38:45] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [05:40:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2248 (re)pooling @ 7%: Pooling 1P host in s4', diff saved to https://phabricator.wikimedia.org/P83965 and previous config saved to /var/cache/conftool/dbconfig/20251016-054027-root.json [05:41:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:42:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:45:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2240 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83967 and previous config saved to /var/cache/conftool/dbconfig/20251016-054504-root.json [05:50:32] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#11279913 (10Marostegui) @Jhancock.wm probably we also need Legacy BIOS by the way. [05:50:55] marostegui@cumin1003 reimage (PID 3346169) is awaiting input [05:51:25] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bookworm [05:51:36] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#11279914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1003 for host sretest2003.codfw.wmnet with OS bookworm executed with errors: - sretest2003... [05:53:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1054 (re)pooling @ 25%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83968 and previous config saved to /var/cache/conftool/dbconfig/20251016-055346-root.json [05:53:52] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [05:55:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2248 (re)pooling @ 10%: Pooling 1P host in s4', diff saved to https://phabricator.wikimedia.org/P83969 and previous config saved to /var/cache/conftool/dbconfig/20251016-055534-root.json [05:59:53] (03PS1) 10Marostegui: db1186: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1196558 (https://phabricator.wikimedia.org/T407463) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T0600). [06:00:43] (03CR) 10Marostegui: [C:03+2] db1186: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1196558 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [06:02:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1186.eqiad.wmnet with reason: Maintenance [06:03:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1186 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83970 and previous config saved to /var/cache/conftool/dbconfig/20251016-060300-marostegui.json [06:03:05] (03PS7) 10Giuseppe Lavagetto: cache: exclude logged-in users from requestctl logged_in_filters [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) [06:08:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1054 (re)pooling @ 30%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83971 and previous config saved to /var/cache/conftool/dbconfig/20251016-060852-root.json [06:08:53] (03CR) 10Giuseppe Lavagetto: [C:04-1] "The patch as it stands now does something we don't want: remove the concept of wikimedia_trust completely; instead, the task is about only" [puppet] - 10https://gerrit.wikimedia.org/r/1192230 (https://phabricator.wikimedia.org/T399688) (owner: 10BCornwall) [06:08:57] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:10:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2248 (re)pooling @ 20%: Pooling 1P host in s4', diff saved to https://phabricator.wikimedia.org/P83972 and previous config saved to /var/cache/conftool/dbconfig/20251016-061040-root.json [06:10:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83973 and previous config saved to /var/cache/conftool/dbconfig/20251016-061054-root.json [06:12:54] fceratto@cumin1003 clone_es (PID 3171023) is awaiting input [06:16:39] (03PS1) 10Marostegui: db2145: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1196559 (https://phabricator.wikimedia.org/T407463) [06:17:25] (03CR) 10Marostegui: [C:03+2] db2145: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1196559 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [06:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:18:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2145.codfw.wmnet with reason: Maintenance [06:18:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2145 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83974 and previous config saved to /var/cache/conftool/dbconfig/20251016-061818-marostegui.json [06:23:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1054 (re)pooling @ 50%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83975 and previous config saved to /var/cache/conftool/dbconfig/20251016-062358-root.json [06:24:03] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:25:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2248 (re)pooling @ 25%: Pooling 1P host in s4', diff saved to https://phabricator.wikimedia.org/P83976 and previous config saved to /var/cache/conftool/dbconfig/20251016-062546-root.json [06:26:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83977 and previous config saved to /var/cache/conftool/dbconfig/20251016-062600-root.json [06:26:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2145 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83978 and previous config saved to /var/cache/conftool/dbconfig/20251016-062618-root.json [06:29:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196430 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [06:39:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1054 (re)pooling @ 60%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83979 and previous config saved to /var/cache/conftool/dbconfig/20251016-063904-root.json [06:39:09] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:40:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2248 (re)pooling @ 30%: Pooling 1P host in s4', diff saved to https://phabricator.wikimedia.org/P83980 and previous config saved to /var/cache/conftool/dbconfig/20251016-064052-root.json [06:41:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83981 and previous config saved to /var/cache/conftool/dbconfig/20251016-064106-root.json [06:41:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2145 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83982 and previous config saved to /var/cache/conftool/dbconfig/20251016-064124-root.json [06:44:03] (03CR) 10Muehlenhoff: [C:03+2] Add a Prometheus exporter to monitor the validity of the internal Ganeti CA [puppet] - 10https://gerrit.wikimedia.org/r/1196430 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [06:54:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1054 (re)pooling @ 75%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83983 and previous config saved to /var/cache/conftool/dbconfig/20251016-065410-root.json [06:54:15] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:55:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2248 (re)pooling @ 50%: Pooling 1P host in s4', diff saved to https://phabricator.wikimedia.org/P83984 and previous config saved to /var/cache/conftool/dbconfig/20251016-065558-root.json [06:56:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83985 and previous config saved to /var/cache/conftool/dbconfig/20251016-065612-root.json [06:56:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2145 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83986 and previous config saved to /var/cache/conftool/dbconfig/20251016-065630-root.json [07:00:05] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T0700). [07:00:05] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:11] hello [07:01:15] I will start with my deployment [07:09:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1054 (re)pooling @ 100%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83987 and previous config saved to /var/cache/conftool/dbconfig/20251016-070916-root.json [07:09:21] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:10:32] OK, I'm done [07:11:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2248 (re)pooling @ 60%: Pooling 1P host in s4', diff saved to https://phabricator.wikimedia.org/P83988 and previous config saved to /var/cache/conftool/dbconfig/20251016-071104-root.json [07:11:35] !log UTC morning deploys done [07:11:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2145 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83989 and previous config saved to /var/cache/conftool/dbconfig/20251016-071136-root.json [07:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:14] kostajh: that was fast, congrats! =) [07:18:45] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2033 gradually with 4 steps - Pool es2033.codfw.wmnet in after cloning [07:26:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2248 (re)pooling @ 75%: Pooling 1P host in s4', diff saved to https://phabricator.wikimedia.org/P83991 and previous config saved to /var/cache/conftool/dbconfig/20251016-072610-root.json [07:27:43] (03PS1) 10Marostegui: db2188: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1196565 (https://phabricator.wikimedia.org/T407463) [07:28:25] (03CR) 10Marostegui: [C:03+2] db2188: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1196565 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [07:29:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2188.codfw.wmnet with reason: Maintenance [07:29:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2188 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83992 and previous config saved to /var/cache/conftool/dbconfig/20251016-072932-marostegui.json [07:33:36] (03PS1) 10Marostegui: es1055: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196566 (https://phabricator.wikimedia.org/T406488) [07:34:10] (03CR) 10Marostegui: [C:03+2] es1055: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196566 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [07:36:02] (03PS1) 10Marostegui: instances.yaml: Add es1055 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196567 (https://phabricator.wikimedia.org/T406488) [07:37:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2188 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83995 and previous config saved to /var/cache/conftool/dbconfig/20251016-073719-root.json [07:37:27] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1055 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196567 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [07:41:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add es1055 to dbctl depooled T406488', diff saved to https://phabricator.wikimedia.org/P83996 and previous config saved to /var/cache/conftool/dbconfig/20251016-074118-marostegui.json [07:41:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2248 (re)pooling @ 100%: Pooling 1P host in s4', diff saved to https://phabricator.wikimedia.org/P83997 and previous config saved to /var/cache/conftool/dbconfig/20251016-074122-root.json [07:41:24] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:43:22] (03PS5) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) [07:43:36] (03CR) 10Brouberol: [C:03+2] data: add yubikey-generated ssh key to the brouberol user [puppet] - 10https://gerrit.wikimedia.org/r/1196473 (owner: 10Brouberol) [07:46:19] (03PS6) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) [07:46:55] (03CR) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on test2wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [07:50:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1055 (re)pooling @ 1%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83999 and previous config saved to /var/cache/conftool/dbconfig/20251016-075012-root.json [07:50:17] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:52:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2188 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84000 and previous config saved to /var/cache/conftool/dbconfig/20251016-075225-root.json [07:54:47] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 264936 [07:55:20] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 264936 [08:00:05] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T0800) [08:03:43] hi, I am holding [08:03:44] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11280174 (10MatthewVernon) [08:03:46] hi, I am holding due to https://phabricator.wikimedia.org/T407323 [08:03:52] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11280175 (10MatthewVernon) Looks good now, thanks :) [08:04:06] well it is probably NOT an issue, but we need a bit of time to investigate it and find out whether it is an actual blocker [08:04:15] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2033 gradually with 4 steps - Pool es2033.codfw.wmnet in after cloning [08:04:16] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es2033.codfw.wmnet onto es2054.codfw.wmnet [08:04:31] (03PS1) 10Arnaudb: gerrit: tweak mod_qos ClientPrefer SrvMaxConnClose [puppet] - 10https://gerrit.wikimedia.org/r/1196568 [08:04:31] (03CR) 10Arnaudb: "Unfortunately, @jgleeson@wikimedia.org had the error at a moment where Gerrit was unable to handle more connections than it was receiving " [puppet] - 10https://gerrit.wikimedia.org/r/1196568 (owner: 10Arnaudb) [08:05:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1055 (re)pooling @ 5%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84002 and previous config saved to /var/cache/conftool/dbconfig/20251016-080518-root.json [08:05:22] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [08:05:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1260 gradually with 4 steps - Pool db1260.eqiad.wmnet in after cloning [08:07:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2188 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84004 and previous config saved to /var/cache/conftool/dbconfig/20251016-080731-root.json [08:07:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11280189 (10MatthewVernon) @elukey FWIW, feel free to wipe these disks (the host isn't in the swift rings ATM). [08:07:59] (03PS1) 10Marostegui: instances.yaml: Remove es1026 [puppet] - 10https://gerrit.wikimedia.org/r/1196569 (https://phabricator.wikimedia.org/T407351) [08:08:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [08:08:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [08:08:34] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1026 [puppet] - 10https://gerrit.wikimedia.org/r/1196569 (https://phabricator.wikimedia.org/T407351) (owner: 10Marostegui) [08:09:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1026 from dbctl T407351', diff saved to https://phabricator.wikimedia.org/P84005 and previous config saved to /var/cache/conftool/dbconfig/20251016-080948-marostegui.json [08:09:53] T407351: decommission es1026.eqiad.wmnet - https://phabricator.wikimedia.org/T407351 [08:12:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [08:12:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [08:14:20] (03CR) 10Cathal Mooney: [C:03+1] "LGTM nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [08:14:20] dealt, the issue has no user facing impact [08:14:29] I'll push to all wikis [08:14:30] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11280221 (10MoritzMuehlenhoff) Looking at the logs at one of the new bookworm replicas in codfw (maps2012), the errors where we hit the connection limit are entirely gone. The rate o... [08:15:28] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [08:17:45] (03PS1) 10Marostegui: db1235: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1196621 (https://phabricator.wikimedia.org/T407463) [08:19:09] (03CR) 10Marostegui: [C:03+2] db1235: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1196621 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [08:20:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1055 (re)pooling @ 7%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84006 and previous config saved to /var/cache/conftool/dbconfig/20251016-082023-root.json [08:20:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1235.eqiad.wmnet with reason: Maintenance [08:20:28] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [08:20:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1235 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84007 and previous config saved to /var/cache/conftool/dbconfig/20251016-082031-marostegui.json [08:20:59] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11280249 (10elukey) +1 I agree with Moritz, just depooled eqiad :) @TheDJ we are now using only codfw (new stack), so any performance issue should be visible. If you find any could y... [08:22:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2188 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84009 and previous config saved to /var/cache/conftool/dbconfig/20251016-082237-root.json [08:22:49] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196624 (https://phabricator.wikimedia.org/T405679) [08:22:51] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196624 (https://phabricator.wikimedia.org/T405679) (owner: 10TrainBranchBot) [08:24:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [08:24:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [08:25:29] (03PS1) 10Marostegui: es1026: Decommission host [puppet] - 10https://gerrit.wikimedia.org/r/1196625 (https://phabricator.wikimedia.org/T407351) [08:25:37] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196624 (https://phabricator.wikimedia.org/T405679) (owner: 10TrainBranchBot) [08:26:02] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1026.eqiad.wmnet [08:26:56] (03CR) 10Marostegui: [C:03+2] es1026: Decommission host [puppet] - 10https://gerrit.wikimedia.org/r/1196625 (https://phabricator.wikimedia.org/T407351) (owner: 10Marostegui) [08:27:48] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1026.eqiad.wmnet - https://phabricator.wikimedia.org/T407351#11280268 (10Marostegui) a:05Marostegui→03None [08:28:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1235 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84010 and previous config saved to /var/cache/conftool/dbconfig/20251016-082825-root.json [08:31:46] (03PS1) 10Cathal Mooney: ssw1-d1-eqiad: add to puppet inventory and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1196626 (https://phabricator.wikimedia.org/T405558) [08:32:00] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.23 refs T405679 [08:32:00] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [08:32:04] T405679: 1.45.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T405679 [08:33:16] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache: exclude logged-in users from requestctl logged_in_filters [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) (owner: 10Giuseppe Lavagetto) [08:35:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1055 (re)pooling @ 10%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84011 and previous config saved to /var/cache/conftool/dbconfig/20251016-083529-root.json [08:35:35] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [08:35:43] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1026.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [08:36:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1026.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [08:36:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:36:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1026.eqiad.wmnet [08:36:12] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1026.eqiad.wmnet - https://phabricator.wikimedia.org/T407351#11280300 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1003 for hosts: `es1026.eqiad.wmnet` - es1026.eqiad.wmnet (**PASS**) - D... [08:36:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1026.eqiad.wmnet - https://phabricator.wikimedia.org/T407351#11280302 (10Marostegui) Ready for #dc-ops [08:37:57] (03CR) 10Elukey: [C:03+1] ssw1-d1-eqiad: add to puppet inventory and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1196626 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [08:41:02] (03CR) 10Arnaudb: [C:03+2] gerrit: ask the operator to merge puppet earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1196227 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:42:46] (03PS1) 10Marostegui: mariadb: Define mariadb packages for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1196628 (https://phabricator.wikimedia.org/T407472) [08:43:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1235 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84013 and previous config saved to /var/cache/conftool/dbconfig/20251016-084331-root.json [08:44:19] (03CR) 10Marostegui: "@taavi@wikimedia.org I assume there are no buster hosts in cloud land?" [puppet] - 10https://gerrit.wikimedia.org/r/1196628 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui) [08:47:55] (03Merged) 10jenkins-bot: gerrit: ask the operator to merge puppet earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1196227 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:48:01] (03PS1) 10Arnaudb: gerrit: disable gerrit service to enable backups [puppet] - 10https://gerrit.wikimedia.org/r/1196629 (https://phabricator.wikimedia.org/T387833) [08:48:27] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Define mariadb packages for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1196628 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui) [08:50:38] 06SRE, 10Hiddenparma, 13Patch-For-Review: Exclude logged in users from requestctl general filters, create separate scope for it. - https://phabricator.wikimedia.org/T407092#11280398 (10Joe) 05Open→03Resolved p:05Triage→03High [08:50:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1055 (re)pooling @ 20%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84014 and previous config saved to /var/cache/conftool/dbconfig/20251016-085035-root.json [08:50:43] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [08:51:08] (03CR) 10Majavah: "correct!" [puppet] - 10https://gerrit.wikimedia.org/r/1196628 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui) [08:51:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1260 gradually with 4 steps - Pool db1260.eqiad.wmnet in after cloning [08:51:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1260.eqiad.wmnet onto db1262.eqiad.wmnet [08:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:52:39] (03PS1) 10Cathal Mooney: Reverse snippets: add INCLUDEs for new loopback ranges eqiad [dns] - 10https://gerrit.wikimedia.org/r/1196630 (https://phabricator.wikimedia.org/T405558) [08:53:18] (03PS1) 10Joal: Update sqoop for mediawiki_history [puppet] - 10https://gerrit.wikimedia.org/r/1196631 (https://phabricator.wikimedia.org/T406000) [08:53:55] (03CR) 10Marostegui: "@fceratto@wikimedia.org I've removed +2 from you cause I am not ready to merge yet, if it looks good to you go with +1 for now." [puppet] - 10https://gerrit.wikimedia.org/r/1196628 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui) [08:54:05] (03CR) 10Marostegui: "Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1196628 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui) [08:56:07] (03CR) 10Elukey: [C:03+1] Reverse snippets: add INCLUDEs for new loopback ranges eqiad [dns] - 10https://gerrit.wikimedia.org/r/1196630 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [08:56:42] (03CR) 10Cathal Mooney: [C:03+2] Reverse snippets: add INCLUDEs for new loopback ranges eqiad [dns] - 10https://gerrit.wikimedia.org/r/1196630 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [08:56:57] !log cmooney@dns2005 START - running authdns-update [08:57:59] !log cmooney@dns2005 END - running authdns-update [08:58:16] (03PS1) 10Federico Ceratto: es2054.yaml, instances.yaml: enable notifications, add es2054 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196632 (https://phabricator.wikimedia.org/T402859) [08:58:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1235 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84016 and previous config saved to /var/cache/conftool/dbconfig/20251016-085837-root.json [08:59:00] (03CR) 10Cathal Mooney: [C:03+2] ssw1-d1-eqiad: add to puppet inventory and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1196626 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [08:59:55] (03PS1) 10Muehlenhoff: Enable the Prometheus exporter for the Ganeti CA on Ganeti masters [puppet] - 10https://gerrit.wikimedia.org/r/1196634 (https://phabricator.wikimedia.org/T382902) [09:00:11] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [09:00:33] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on gerrit2002.wikimedia.org with reason: T407110 [09:02:08] (03PS1) 10Tiziano Fogli: nrpe::monitor_service: Propagate migration task to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/1196633 (https://phabricator.wikimedia.org/T395443) [09:02:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196634 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [09:02:56] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:04:34] (03CR) 10Marostegui: [C:03+1] es2054.yaml, instances.yaml: enable notifications, add es2054 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196632 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:05:38] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11280435 (10jijiki) 05Open→03In progress a:03thcipriani [09:05:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1055 (re)pooling @ 25%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84017 and previous config saved to /var/cache/conftool/dbconfig/20251016-090541-root.json [09:05:47] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [09:06:01] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11280439 (10jijiki) 05Open→03In progress a:03thcipriani [09:06:34] (03PS2) 10Muehlenhoff: Enable the Prometheus exporter for the Ganeti CA on Ganeti masters [puppet] - 10https://gerrit.wikimedia.org/r/1196634 (https://phabricator.wikimedia.org/T382902) [09:08:26] (03PS3) 10Muehlenhoff: Enable the Prometheus exporter for the Ganeti CA on Ganeti masters [puppet] - 10https://gerrit.wikimedia.org/r/1196634 (https://phabricator.wikimedia.org/T382902) [09:13:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1235 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84018 and previous config saved to /var/cache/conftool/dbconfig/20251016-091343-root.json [09:14:16] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11280447 (10thcipriani) a:05thcipriani→03None Approved from my side, sorry for delay. [09:14:28] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11280449 (10thcipriani) a:05thcipriani→03None Approved from my side, sorry for delay. [09:14:54] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ssw1-d1-eqiad.mgmt with reason: downtime ssw1-d1-eqiad until we have the monitoring checks fully working for the new platform [09:15:06] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11280451 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0e6fc9da-0f8b-4a56-b7a1-276d50744766) set by cmo... [09:19:44] (03PS1) 10MVernon: swift: re-add 3 codfw nodes, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1196635 (https://phabricator.wikimedia.org/T400876) [09:20:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1055 (re)pooling @ 30%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84019 and previous config saved to /var/cache/conftool/dbconfig/20251016-092047-root.json [09:20:52] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [09:21:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196634 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [09:22:22] (03CR) 10Marostegui: [C:03+1] swift: re-add 3 codfw nodes, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1196635 (https://phabricator.wikimedia.org/T400876) (owner: 10MVernon) [09:24:58] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11280482 (10MatthewVernon) [09:26:04] (03CR) 10MVernon: [C:03+2] swift: re-add 3 codfw nodes, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1196635 (https://phabricator.wikimedia.org/T400876) (owner: 10MVernon) [09:26:39] (03PS1) 10Marostegui: mariadb: Productionize db1263 [puppet] - 10https://gerrit.wikimedia.org/r/1196636 (https://phabricator.wikimedia.org/T406550) [09:27:36] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db1263 [puppet] - 10https://gerrit.wikimedia.org/r/1196636 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui) [09:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:29:51] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196638 [09:30:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db1260.eqiad.wmnet onto db1263.eqiad.wmnet [09:30:51] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1260 - Depool db1260.eqiad.wmnet to then clone it to db1263.eqiad.wmnet - marostegui@cumin1003 [09:31:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1260 - Depool db1260.eqiad.wmnet to then clone it to db1263.eqiad.wmnet - marostegui@cumin1003 [09:31:36] (03CR) 10Jelto: [C:03+1] "lgtm, although it would be better to focus on the CDN migration (or at least fix gitiles) rather than spending time on this workarounds" [puppet] - 10https://gerrit.wikimedia.org/r/1196568 (owner: 10Arnaudb) [09:31:47] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228#11280538 (10thcipriani) `deployment` access makes sense from my side: Approved. [09:32:44] (03CR) 10Federico Ceratto: [C:03+2] es2054.yaml, instances.yaml: enable notifications, add es2054 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196632 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:33:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for JavierMonton - https://phabricator.wikimedia.org/T407187#11280542 (10thcipriani) `deployment` access makes sense from my side: Approved. [09:33:44] (03CR) 10Jelto: "We probably also want this setting for the test runners in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/p" [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [09:34:25] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Maria Lechner WMDE - https://phabricator.wikimedia.org/T406106#11280545 (10jijiki) 05Open→03Resolved a:03jijiki Added to nda and wmde ldap groups. [09:34:42] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (an-test-master1002), Fresh: 146 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:35:04] oh [09:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:35:26] what's an-test-master1002 and why does it need backup? [09:35:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1055 (re)pooling @ 50%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84021 and previous config saved to /var/cache/conftool/dbconfig/20251016-093553-root.json [09:35:58] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [09:36:02] (03PS1) 10Brouberol: opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) [09:36:58] (03PS2) 10Brouberol: opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) [09:40:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11280566 (10MoritzMuehlenhoff) We'll depool batches of servers which can be switched over. It totally depends on the VMs the nodes are running, fo... [09:41:32] (03PS1) 10Effie Mouzeli: data.yaml: add marialechnerwmde to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1196641 (https://phabricator.wikimedia.org/T406106) [09:41:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:42:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:53] (03CR) 10CI reject: [V:04-1] opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [09:44:20] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11280581 (10jijiki) a:03Volker_E [09:45:06] (03PS2) 10Effie Mouzeli: data.yaml: add marialechnerwmde to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1196641 (https://phabricator.wikimedia.org/T405917) [09:47:04] (03PS3) 10Brouberol: opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) [09:48:38] (03PS4) 10Brouberol: opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) [09:48:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11280587 (10jijiki) Pinged user for out of band key verification [09:49:13] (03CR) 10Btullis: opensearch-operator: add the ability for the operator to watch several namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [09:50:13] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11280590 (10Ladsgroup) >>! In T407094#11280587, @jijiki wrote: > Pinged user for out of band key verification Thanks! [09:50:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1055 (re)pooling @ 60%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84022 and previous config saved to /var/cache/conftool/dbconfig/20251016-095058-root.json [09:51:03] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [09:52:17] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11280596 (10jijiki) pinged user for out of band key confirmation [09:55:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Add es2054 T402859', diff saved to https://phabricator.wikimedia.org/P84023 and previous config saved to /var/cache/conftool/dbconfig/20251016-095534-fceratto.json [09:55:39] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [09:55:39] (03CR) 10CI reject: [V:04-1] opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [09:56:05] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "I know. I wanted to add the ldap groups first and slowly add more access." [puppet] - 10https://gerrit.wikimedia.org/r/1196396 (https://phabricator.wikimedia.org/T406106) (owner: 10Ladsgroup) [09:56:31] (03PS5) 10Brouberol: opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) [09:56:33] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1196641 (https://phabricator.wikimedia.org/T405917) (owner: 10Effie Mouzeli) [09:56:42] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es2054.codfw.wmnet [09:56:42] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2054.codfw.wmnet [09:56:53] (03PS6) 10Brouberol: opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) [09:57:36] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2054 slowly with 10 steps - Pooling in new host [09:59:07] (03PS7) 10Brouberol: opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) [09:59:12] (03CR) 10Brouberol: opensearch-operator: add the ability for the operator to watch several namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1000) [10:00:55] (03CR) 10FNegri: "What about putting this setting in hieradata/role/common/gitlab_runner.yaml? In this way it will apply to both devtools/gitlab-runner-XXXX" [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [10:01:22] (03PS8) 10Brouberol: opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) [10:01:30] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11280620 (10MoritzMuehlenhoff) [10:01:50] (03CR) 10Majavah: "> In this way it will apply to both devtools/gitlab-runner-XXXX and gitlab-runners/runner-XXXX." [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [10:04:55] (03CR) 10FNegri: "Can you explain why not? My assumption was based on the fact that role::gitlab_runner is present in both sets of hosts, and nowhere else (" [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [10:06:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1055 (re)pooling @ 75%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84025 and previous config saved to /var/cache/conftool/dbconfig/20251016-100605-root.json [10:06:10] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [10:06:56] (03CR) 10Btullis: [C:03+1] opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [10:07:25] (03CR) 10Volans: Don't skip elasticsearch tests anymore on older py versions. (031 comment) [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1195224 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [10:08:21] (03CR) 10CI reject: [V:04-1] opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [10:08:27] (03CR) 10Elukey: "I just realized this patch is for the debian branch, sigh" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1195224 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [10:10:44] (03Abandoned) 10Elukey: sre.puppet.renew-cert: skip destroy when needed. [cookbooks] - 10https://gerrit.wikimedia.org/r/1191387 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey) [10:11:37] (03PS9) 10Brouberol: opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) [10:13:46] (03PS1) 10Cathal Mooney: ssw1-d1-eqiad: disable OSPF and BFD checks in Icinga [puppet] - 10https://gerrit.wikimedia.org/r/1196644 (https://phabricator.wikimedia.org/T405558) [10:14:53] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1195211 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [10:15:05] !log installing libfcgi security updates [10:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:20] (03Abandoned) 10Elukey: Don't skip elasticsearch tests anymore on older py versions. [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1195224 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [10:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:18:27] (03PS1) 10Esanders: LQT convert: Ignore duplicate key insert errors when command line flag set [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196645 (https://phabricator.wikimedia.org/T407357) [10:18:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196645 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [10:19:26] (03CR) 10Elukey: [C:03+1] ssw1-d1-eqiad: disable OSPF and BFD checks in Icinga [puppet] - 10https://gerrit.wikimedia.org/r/1196644 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [10:20:33] (03CR) 10Cathal Mooney: [C:03+2] ssw1-d1-eqiad: disable OSPF and BFD checks in Icinga [puppet] - 10https://gerrit.wikimedia.org/r/1196644 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [10:21:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1055 (re)pooling @ 100%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84027 and previous config saved to /var/cache/conftool/dbconfig/20251016-102110-root.json [10:21:13] (03CR) 10Brouberol: [C:03+2] opensearch-operator: add the ability for the operator to watch several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196639 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [10:21:16] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [10:21:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11280664 (10elukey) @Jhancock.wm I am a bit confused on what to do. What do you mean by wiping in this case? Is there anything to do in the BIOS menu or... [10:22:03] (03CR) 10Muehlenhoff: [C:03+2] url_downloader: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1193701 (https://phabricator.wikimedia.org/T405631) (owner: 10Muehlenhoff) [10:23:09] (03PS2) 10Elukey: setup.py: remove the elastic dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1195208 (https://phabricator.wikimedia.org/T390860) [10:24:23] (03CR) 10Elukey: [C:03+2] Remove the elasticsearch dependency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1195211 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [10:26:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:26:01] (03PS1) 10Marostegui: control-mariadb-10.11-trixie: Fix version [software] - 10https://gerrit.wikimedia.org/r/1196646 (https://phabricator.wikimedia.org/T406981) [10:26:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:27:14] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-trixie: Fix version [software] - 10https://gerrit.wikimedia.org/r/1196646 (https://phabricator.wikimedia.org/T406981) (owner: 10Marostegui) [10:27:43] (03Merged) 10jenkins-bot: control-mariadb-10.11-trixie: Fix version [software] - 10https://gerrit.wikimedia.org/r/1196646 (https://phabricator.wikimedia.org/T406981) (owner: 10Marostegui) [10:28:43] (03CR) 10Elukey: [C:03+1] remote: Support timezone-aware objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196139 (https://phabricator.wikimedia.org/T401581) (owner: 10Majavah) [10:32:09] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1195208 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [10:34:49] (03CR) 10FNegri: "role::gitlab_runner is also present in bare metal runners (e.g. gitlab-runner1002.eqiad.wmnet), where we don't need to lower the MTU." [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [10:34:50] (03PS1) 10Brouberol: opensearch-operator: the leader election role should be installed in the operator ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196648 (https://phabricator.wikimedia.org/T404874) [10:35:25] (03CR) 10Btullis: [C:03+1] opensearch-operator: the leader election role should be installed in the operator ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196648 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [10:36:19] (03PS1) 10Muehlenhoff: Remove failoid role from failoid[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/1196649 (https://phabricator.wikimedia.org/T402406) [10:41:52] (03CR) 10Elukey: [C:03+2] setup.py: remove the elastic dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1195208 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [10:42:18] (03CR) 10Brouberol: [C:03+2] opensearch-operator: the leader election role should be installed in the operator ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196648 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [10:42:35] 06SRE, 06Data-Platform-SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11280711 (10BTullis) Thanks all for raising this ticket and for your kind feedback so far. I totally agree that: > `analytics-privatedata-users` is confusing for both applican... [10:43:31] (03CR) 10FNegri: "@jwodstrcil@wikimedia.org do we need to retain the different `log-driver` setting between gitlab-runner-1007 (`{log-driver: none}`) and gi" [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [10:43:36] (03PS1) 10Ozge: feat: upgrades article quality buildkit 1.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196650 [10:43:59] (03PS2) 10Ozge: feat: upgrades article quality buildkit 1.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196650 (https://phabricator.wikimedia.org/T400446) [10:45:51] (03PS1) 10Marostegui: control-mariadb-client-10.6-bookworm: Remove duplicate [software] - 10https://gerrit.wikimedia.org/r/1196651 [10:46:31] (03CR) 10Marostegui: [C:03+2] control-mariadb-client-10.6-bookworm: Remove duplicate [software] - 10https://gerrit.wikimedia.org/r/1196651 (owner: 10Marostegui) [10:46:56] (03Merged) 10jenkins-bot: control-mariadb-client-10.6-bookworm: Remove duplicate [software] - 10https://gerrit.wikimedia.org/r/1196651 (owner: 10Marostegui) [10:49:50] jouncebot: nowandnext [10:49:50] For the next 0 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1000) [10:49:50] In 1 hour(s) and 10 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1200) [10:50:02] (03PS1) 10Brouberol: opensearch-operator: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196652 (https://phabricator.wikimedia.org/T404874) [10:53:17] !log hnowlan@deploy2002 Started deploy [restbase/deploy@0be0059]: deploy 9 new wikis from r/1177553 [10:57:50] (03CR) 10Muehlenhoff: [C:03+2] Remove failoid role from failoid[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/1196649 (https://phabricator.wikimedia.org/T402406) (owner: 10Muehlenhoff) [10:58:11] (03CR) 10Effie Mouzeli: [C:03+2] data.yaml: add marialechnerwmde to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1196641 (https://phabricator.wikimedia.org/T405917) (owner: 10Effie Mouzeli) [10:58:16] (03CR) 10Brouberol: [C:03+2] opensearch-operator: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196652 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [11:01:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:01:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:03:28] (03PS1) 10Marostegui: control-mariadb-client-10.11-trixie: Add to repo [software] - 10https://gerrit.wikimedia.org/r/1196656 (https://phabricator.wikimedia.org/T406981) [11:04:04] (03CR) 10Marostegui: [C:03+2] control-mariadb-client-10.11-trixie: Add to repo [software] - 10https://gerrit.wikimedia.org/r/1196656 (https://phabricator.wikimedia.org/T406981) (owner: 10Marostegui) [11:04:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:04:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:04:35] (03Merged) 10jenkins-bot: control-mariadb-client-10.11-trixie: Add to repo [software] - 10https://gerrit.wikimedia.org/r/1196656 (https://phabricator.wikimedia.org/T406981) (owner: 10Marostegui) [11:04:37] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [11:05:13] (03CR) 10Vgutierrez: [C:03+1] trafficserver: test2wiki action api to rest-gw [puppet] - 10https://gerrit.wikimedia.org/r/1196046 (https://phabricator.wikimedia.org/T406599) (owner: 10Clément Goubert) [11:05:26] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:07:51] (03CR) 10Clément Goubert: [C:03+2] trafficserver: test2wiki action api to rest-gw [puppet] - 10https://gerrit.wikimedia.org/r/1196046 (https://phabricator.wikimedia.org/T406599) (owner: 10Clément Goubert) [11:08:33] !log sudo cumin 'A:cp' "disable-puppet 'Deploying gateway-check.lua changes - T406599 - cgoubert'" [11:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:37] T406599: Action API Test Plan & Execution for rest gateway rerouting - https://phabricator.wikimedia.org/T406599 [11:10:46] effie: merging your data.yaml change [11:12:42] !log installing Squid security updates [11:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:53] claime: cheers [11:14:09] I ran into moritz while merging it [11:15:08] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11280803 (10jijiki) 05Open→03Resolved This is sorted [11:16:37] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196650 (https://phabricator.wikimedia.org/T400446) (owner: 10Ozge) [11:18:37] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add routes for action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196047 (https://phabricator.wikimedia.org/T406324) (owner: 10Clément Goubert) [11:19:33] !log hnowlan@deploy2002 Finished deploy [restbase/deploy@0be0059]: deploy 9 new wikis from r/1177553 (duration: 27m 01s) [11:20:31] (03Merged) 10jenkins-bot: rest-gateway: Add routes for action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196047 (https://phabricator.wikimedia.org/T406324) (owner: 10Clément Goubert) [11:21:03] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:21:16] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:21:22] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:21:35] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:21:43] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:22:12] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:24:31] 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11280823 (10LSobanski) Just a heads up that the alert fired again, can it be silenced for another month? [11:25:10] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: PuppetConstantChange (instance prometheus2007:9100) - https://phabricator.wikimedia.org/T407484 (10LSobanski) 03NEW [11:26:33] !log sudo cumin 'A:cp' "enable-puppet 'Deploying gateway-check.lua changes - T406599 - cgoubert' [11:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:37] T406599: Action API Test Plan & Execution for rest gateway rerouting - https://phabricator.wikimedia.org/T406599 [11:33:09] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [11:33:54] I can confirm wikitech-static is not loading for me [11:38:03] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 3.524 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [11:41:29] (03PS7) 10Clément Goubert: Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [11:43:59] 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11280902 (10Clement_Goubert) Done [11:44:55] (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1196663 [11:47:05] (03PS3) 10Muehlenhoff: Remove Hiera option to disable agent forwarding [puppet] - 10https://gerrit.wikimedia.org/r/1189855 (https://phabricator.wikimedia.org/T198138) [11:48:26] (03CR) 10Kamila Součková: [C:03+1] wikikube: Add wikikube-ctrl200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1195350 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [11:50:20] (03CR) 10Ozge: [C:03+2] feat: upgrades article quality buildkit 1.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196650 (https://phabricator.wikimedia.org/T400446) (owner: 10Ozge) [11:52:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189855 (https://phabricator.wikimedia.org/T198138) (owner: 10Muehlenhoff) [11:54:43] !log ozge@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:56:22] (03PS4) 10Muehlenhoff: Remove Hiera option to disable agent forwarding [puppet] - 10https://gerrit.wikimedia.org/r/1189855 (https://phabricator.wikimedia.org/T198138) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1200) [12:02:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189855 (https://phabricator.wikimedia.org/T198138) (owner: 10Muehlenhoff) [12:03:42] (03CR) 10Muehlenhoff: [C:03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1196663 (owner: 10Muehlenhoff) [12:03:47] !log jmm@dns1004 START - running authdns-update [12:05:04] !log jmm@dns1004 END - running authdns-update [12:06:56] 06SRE, 06Infrastructure-Foundations, 10netops: mr1-codfw is single-homed to lsw1-a2-codfw - https://phabricator.wikimedia.org/T407488 (10cmooney) 03NEW p:05Triage→03Low [12:12:09] 06SRE, 06Data-Engineering, 10DPE-Mediawiki-Content, 10Dumps-Generation, 07Epic: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#11280973 (10BTullis) Hello again. It looks like the wikibase dumps performance issue described in {T389199} may have... [12:12:55] (03PS1) 10Ladsgroup: codesearch: Add apps service [puppet] - 10https://gerrit.wikimedia.org/r/1196670 (https://phabricator.wikimedia.org/T335407) [12:13:40] (03CR) 10CI reject: [V:04-1] codesearch: Add apps service [puppet] - 10https://gerrit.wikimedia.org/r/1196670 (https://phabricator.wikimedia.org/T335407) (owner: 10Ladsgroup) [12:13:42] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2054 slowly with 10 steps - Pooling in new host [12:15:48] (03PS2) 10Ladsgroup: codesearch: Add apps service [puppet] - 10https://gerrit.wikimedia.org/r/1196670 (https://phabricator.wikimedia.org/T335407) [12:16:42] (03CR) 10Ladsgroup: [V:03+2 C:03+2] codesearch: Add apps service [puppet] - 10https://gerrit.wikimedia.org/r/1196670 (https://phabricator.wikimedia.org/T335407) (owner: 10Ladsgroup) [12:18:19] (03PS1) 10Cathal Mooney: mgmt routers: allow ping from netflow* VMs to mgmt [homer/public] - 10https://gerrit.wikimedia.org/r/1196671 [12:20:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [homer/public] - 10https://gerrit.wikimedia.org/r/1196671 (owner: 10Cathal Mooney) [12:22:42] (03CR) 10Cathal Mooney: [C:03+2] mgmt routers: allow ping from netflow* VMs to mgmt [homer/public] - 10https://gerrit.wikimedia.org/r/1196671 (owner: 10Cathal Mooney) [12:23:55] (03Merged) 10jenkins-bot: mgmt routers: allow ping from netflow* VMs to mgmt [homer/public] - 10https://gerrit.wikimedia.org/r/1196671 (owner: 10Cathal Mooney) [12:25:15] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196672 [12:36:39] !log installing gst-plugins-base1.0 security updates [12:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:12] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1026.eqiad.wmnet - https://phabricator.wikimedia.org/T407351#11281034 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:48:52] (03PS6) 10Ayounsi: [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 [12:49:49] 06SRE, 06Data-Engineering, 10DPE-Mediawiki-Content, 10Dumps-Generation, 07Epic: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#11281067 (10Ladsgroup) Doing it temporarily and specially on the wikibase dumpers should be fineTM (we had a lot of i... [12:50:30] (03CR) 10Cathal Mooney: [C:03+2] [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 (owner: 10Ayounsi) [12:51:35] !log installing git security updates [12:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:52] (03CR) 10Elukey: [C:03+1] Remove Hiera option to disable agent forwarding [puppet] - 10https://gerrit.wikimedia.org/r/1189855 (https://phabricator.wikimedia.org/T198138) (owner: 10Muehlenhoff) [12:52:16] (03Merged) 10jenkins-bot: [WIP] Analytics and loopback ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1186958 (owner: 10Ayounsi) [12:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:54:03] (03CR) 10Arnaudb: [C:03+2] gerrit: tweak mod_qos ClientPrefer SrvMaxConnClose [puppet] - 10https://gerrit.wikimedia.org/r/1196568 (owner: 10Arnaudb) [12:59:52] (03CR) 10Arnaudb: [C:03+2] "+1 to this, unfortunately yesterday we were scrapped to death a few times so it was necessary to go back to this workaround. I don't see m" [puppet] - 10https://gerrit.wikimedia.org/r/1196568 (owner: 10Arnaudb) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1300) [13:00:05] edsanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] o/ [13:00:14] (03CR) 10CDanis: [C:03+1] "thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196461 (https://phabricator.wikimedia.org/T406455) (owner: 10Muehlenhoff) [13:00:27] Will self deploy - looks like I'm the only one in the window [13:00:38] o/ [13:01:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196645 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [13:01:17] go ahead, I’m happy if I’m not the one to deploy that hack in RevisionStorage ;) [13:02:58] (03Merged) 10jenkins-bot: LQT convert: Ignore duplicate key insert errors when command line flag set [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196645 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [13:03:31] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1196645|LQT convert: Ignore duplicate key insert errors when command line flag set (T407357)]] [13:03:36] T407357: Ignore duplicate key errors when creating Flow posts from LQT - https://phabricator.wikimedia.org/T407357 [13:04:39] (03PS1) 10Neslihan Turan: Revert^2 "Add icons for wikibase changes. WIP" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196681 [13:06:12] !log esanders@deploy2002 esanders: Backport for [[gerrit:1196645|LQT convert: Ignore duplicate key insert errors when command line flag set (T407357)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:07:08] (03PS1) 10Muehlenhoff: Split rpki-root into separate Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1196682 [13:07:32] (03PS2) 10Muehlenhoff: Split pki-root into separate Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1196682 [13:09:26] !log esanders@deploy2002 esanders: Continuing with sync [13:12:16] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1260 gradually with 4 steps - Pool db1260.eqiad.wmnet in after cloning [13:12:41] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11281106 (10elukey) >>! In T381565#11278807, @TheDJ wrote: > i now see loadtimes that are around 3 seconds, so at least it seems better experience wise. Still not as fast as it has be... [13:13:45] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196645|LQT convert: Ignore duplicate key insert errors when command line flag set (T407357)]] (duration: 10m 14s) [13:13:50] T407357: Ignore duplicate key errors when creating Flow posts from LQT - https://phabricator.wikimedia.org/T407357 [13:16:44] Since it seems that nothing else was planed for the window, I would do a deploy [13:18:20] (03CR) 10Zabe: [C:03+2] BETA: Try using Hadoop QueryPage computations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196109 (https://phabricator.wikimedia.org/T309738) (owner: 10Zabe) [13:19:41] (03Merged) 10jenkins-bot: BETA: Try using Hadoop QueryPage computations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196109 (https://phabricator.wikimedia.org/T309738) (owner: 10Zabe) [13:20:04] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1196109|BETA: Try using Hadoop QueryPage computations (T309738)]] [13:20:09] T309738: Move MediaWiki QueryPages computation to Hadoop - https://phabricator.wikimedia.org/T309738 [13:20:27] (03CR) 10Muehlenhoff: [C:03+2] jaeger: Add new IDP IP addressess [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196461 (https://phabricator.wikimedia.org/T406455) (owner: 10Muehlenhoff) [13:22:27] !log zabe@deploy2002 zabe: Backport for [[gerrit:1196109|BETA: Try using Hadoop QueryPage computations (T309738)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:24:00] !log zabe@deploy2002 zabe: Continuing with sync [13:27:57] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp7008*} and A:cp [13:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:13] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196109|BETA: Try using Hadoop QueryPage computations (T309738)]] (duration: 08m 09s) [13:28:17] T309738: Move MediaWiki QueryPages computation to Hadoop - https://phabricator.wikimedia.org/T309738 [13:32:09] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [13:32:14] (03CR) 10Elukey: [C:03+1] Split pki-root into separate Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1196682 (owner: 10Muehlenhoff) [13:32:49] (03CR) 10Ssingh: [V:03+1] "I think this is ready to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [13:33:10] (03PS2) 10Ssingh: url_downloader: remove hcaptcha proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) [13:33:46] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196686 [13:33:52] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7290/co" [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [13:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:37:40] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196689 [13:41:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:42:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:25] (03CR) 10Muehlenhoff: [C:03+2] Split pki-root into separate Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1196682 (owner: 10Muehlenhoff) [13:49:01] PROBLEM - Host cp7008 is DOWN: PING CRITICAL - Packet loss = 100% [13:49:18] wut? [13:49:29] (03PS3) 10Arnaudb: gerrit: rsync and chown fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1196684 (https://phabricator.wikimedia.org/T387833) [13:49:29] sukhe: ^^ expected? [13:49:31] !log jmm@cumin2002 START - Cookbook sre.pki.restart-reboot rolling reboot on A:pki [13:49:47] not expected at all yeah [13:49:57] see https://phabricator.wikimedia.org/T407421 [13:49:58] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet. on all recursors [13:50:01] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30042 bytes in 1.026 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [13:50:01] cp7007 also failed to come back up yesterday [13:50:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet. on all recursors [13:50:11] so we are having some fun in magru clearly [13:50:27] two nodes in the same cluster? [13:50:35] we might need to depool the DC [13:50:39] I mean, this was a reboot so it should have come back up [13:50:41] let me see the SEL [13:52:02] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7291/co" [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [13:53:12] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: sync [13:53:42] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [13:55:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:56:06] huh [13:56:08] wow that [13:56:12] !incidents [13:56:12] 6871 (UNACKED) [2x] ProbeDown sre (ip6 text-https:443 probes/service http_text-https_ip6) [13:56:16] eyes [13:56:17] !incidents [13:56:17] 6871 (UNACKED) [2x] ProbeDown sre (ip6 text-https:443 probes/service http_text-https_ip6) [13:56:17] !ack 6871 [13:56:18] 6871 (ACKED) [2x] ProbeDown sre (ip6 text-https:443 probes/service http_text-https_ip6) [13:56:30] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet. on all recursors [13:56:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet. on all recursors [13:56:36] er, esams and drmrs v6? [13:56:45] here [13:56:48] <_joe_> let me check [13:57:22] <_joe_> i'd check NELs [13:57:44] liberica has been complaining about cp servers [13:57:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1260 gradually with 4 steps - Pool db1260.eqiad.wmnet in after cloning [13:57:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1260.eqiad.wmnet onto db1263.eqiad.wmnet [13:57:59] so not NEL [13:58:08] not seeing increase in 50X [13:58:25] FIRING: [23x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:59:05] !log sudo ipmitool -I lanplus -H "cp7008.mgmt.magru.wmnet" -U root -E chassis power cycle [13:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:39] it has already recovered. I'm not seeing anything in traffic patterns so far [14:00:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:04:15] RECOVERY - Host cp7008 is UP: PING OK - Packet loss = 0%, RTA = 110.43 ms [14:04:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.pki.restart-reboot (exit_code=0) rolling reboot on A:pki [14:05:17] PROBLEM - haproxy process on cp7008 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [14:06:49] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp7008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [14:06:49] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp7008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [14:07:15] resolving ^ [14:07:49] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp7008 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-12-19 12:29:31 +0000 (expires in 63 days) https://wikitech.wikimedia.org/wiki/HTTPS [14:07:49] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp7008 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2025-11-14 05:58:19 +0000 (expires in 28 days) https://wikitech.wikimedia.org/wiki/HTTPS [14:08:17] RECOVERY - haproxy process on cp7008 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [14:08:20] (03PS1) 10Federico Ceratto: preseed.yaml, es2055.yaml, site.pp: Prepare es2055 for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1196696 (https://phabricator.wikimedia.org/T402859) [14:09:43] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7008.magru.wmnet [14:09:43] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp7008*} and A:cp [14:13:25] FIRING: [23x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:30] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7292/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1196372 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [14:17:24] !log bump space for prometheus k8s-dse in eqiad [14:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:17:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11281479 (10Jhancock.wm) i'm honestly not sure if it's a data retention safe guard or a limitation of the hardware. but if you configure disks in bios mo... [14:18:25] FIRING: [23x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:22] (03CR) 10Filippo Giunchedi: [C:03+2] "I'll go ahead, easy enough to change later" [puppet] - 10https://gerrit.wikimedia.org/r/1196372 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [14:21:40] (03CR) 10Filippo Giunchedi: [C:03+2] wmcs: introduce cloud_storage_subnet variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1196372 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [14:23:25] RESOLVED: [23x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:27:09] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11281504 (10MoritzMuehlenhoff) [14:27:53] !log starting `removenode` of aqs1012-a (id=0b0f0cd5-a1f8-44e2-a8e2-75800ebaea80) — T407414 [14:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:57] T407414: aqs1012 is down - https://phabricator.wikimedia.org/T407414 [14:29:19] PROBLEM - Host ml-serve1012 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:04] !log installing distro-info-data updates on Bookworm [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1430) [14:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:47] RECOVERY - Host ml-serve1012 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [14:33:23] !log starting `removenode` of aqs1012-b (id=bc700f01-8120-4d77-908f-eea943470a25)— T407414 [14:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:28] T407414: aqs1012 is down - https://phabricator.wikimedia.org/T407414 [14:34:28] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11281519 (10MoritzMuehlenhoff) [14:37:45] !log installing libarchive security updates [14:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [14:46:23] 06SRE, 07Documentation: The links under "Test IP fragmentation issues" on `wikitech:Reporting a connectivity issue` no longer appear to work - https://phabricator.wikimedia.org/T407505 (10A_smart_kitten) 03NEW [14:47:15] (03PS1) 10Scott French: php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1196701 [14:47:27] (03PS1) 10Arnaudb: gerrit: stop puppet across all instances [cookbooks] - 10https://gerrit.wikimedia.org/r/1196694 (https://phabricator.wikimedia.org/T407200) [14:47:35] (03PS1) 10Arnaudb: gerrit: stop stopping gerrit.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1196695 (https://phabricator.wikimedia.org/T387833) [14:48:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:50:09] (03PS1) 10Cparle: Enable Special:Watchlist pagination on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196703 (https://phabricator.wikimedia.org/T41510) [14:51:06] (03PS2) 10Cparle: Enable Special:EditWatchlist pagination on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196703 (https://phabricator.wikimedia.org/T41510) [14:54:18] !log jhancock@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7007'] [14:55:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196703 (https://phabricator.wikimedia.org/T41510) (owner: 10Cparle) [14:56:50] (03CR) 10Clément Goubert: [C:03+1] php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1196701 (owner: 10Scott French) [14:57:04] (03PS26) 10Herron: thanos-rule: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) [14:57:04] (03CR) 10Herron: [V:03+1] "yes sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [14:57:16] (03PS4) 10CDanis: haproxy: silent-drop as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1176302 [14:58:16] (03PS1) 10Cathal Mooney: Add Nokia devices to common.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1196704 (https://phabricator.wikimedia.org/T405558) [14:59:16] (03PS2) 10Cathal Mooney: Add Nokia devices to common.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1196704 (https://phabricator.wikimedia.org/T405558) [14:59:17] (03CR) 10Dr0ptp4kt: "Clarifying it's `/instrument/.*`, not `/instruments/.*` (as in line with the regex from the patch, but mistakenly typed into a comment! so" [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [15:00:05] hashar and jnuche: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1500) [15:00:13] (03CR) 10BCornwall: [C:03+2] mediawiki/httpbb: Add 25.wikipedia.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/1196141 (https://phabricator.wikimedia.org/T407156) (owner: 10BCornwall) [15:00:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11281615 (10Jhancock.wm) @elukey hey how do i reimage with Debian Trixie. That seems different than running the reimage cookbook. [15:02:19] (03CR) 10BCornwall: [C:03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196141 (https://phabricator.wikimedia.org/T407156) (owner: 10BCornwall) [15:03:33] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7007'] [15:08:28] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:05] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7008.magru.wmnet [reason: updating firmware] [15:10:46] !log sukhe@cumin1003 START - Cookbook sre.hosts.remove-downtime for cp7007.magru.wmnet [15:10:47] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7007.magru.wmnet [15:10:50] !log jhancock@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7008'] [15:13:19] PROBLEM - Host ml-serve1012 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:36] (03CR) 10Elukey: Introduce v1 xLab / MPIC SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [15:13:42] (03CR) 10Elukey: [C:03+1] Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [15:14:19] PROBLEM - Host cp7008 is DOWN: PING CRITICAL - Packet loss = 100% [15:14:34] ^ "expected" [15:15:10] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7008.magru.wmnet with reason: firmware upgrade [15:16:33] FIRING: [2x] KubernetesCalicoDown: ml-serve1012.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:16:46] this is me --^ [15:16:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11281658 (10elukey) @Jhancock.wm `--os trixie` is sufficient. Did you encounter any issue while doing 2045? [15:17:11] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:01] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:20:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11281667 (10ssingh) FWIW doing one or two hosts is more than enough. We will reimage them again anyway so it doesn't make sense IMO for you both to spend time u... [15:20:49] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7008'] [15:22:06] 10ops-magru, 06DC-Ops, 06Traffic: cp7007 hardware issues after reboot - https://phabricator.wikimedia.org/T407421#11281670 (10Jhancock.wm) shorthand from irc chat. Riser is a card between a NIC or PERC and the main board. for space saving reasons. the bus errors are probably the riser and whatever is plugged... [15:22:27] RECOVERY - Host cp7008 is UP: PING OK - Packet loss = 0%, RTA = 110.54 ms [15:22:55] 10ops-magru, 06DC-Ops, 06Traffic: cp7007 hardware issues after reboot - https://phabricator.wikimedia.org/T407421#11281672 (10ssingh) 05Open→03Resolved Thanks to @Jhancock.wm for the help with this! [15:25:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11281678 (10BCornwall) Same here. Feel free to plop something on my calendar! [15:27:32] !log sukhe@cumin1003 START - Cookbook sre.hosts.remove-downtime for cp7008.magru.wmnet [15:27:32] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7008.magru.wmnet [15:28:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [15:28:56] Here ish [15:29:02] me too [15:29:05] !incidents [15:29:06] 6875 (UNACKED) NELHigh sre (thanos-rule tcp.timed_out) [15:29:06] 6871 (RESOLVED) [2x] ProbeDown sre (ip6 text-https:443 probes/service http_text-https_ip6) [15:29:09] Need a minute to find a table [15:29:15] !ack 6875 [15:29:15] 6875 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [15:33:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [15:33:49] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp700[7-8].magru.wmnet [reason: pool after firmware updated] [15:34:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:34:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:33] (03PS5) 10CDanis: haproxy: silent-drop as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1176302 [15:35:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:35:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:37:11] FIRING: [12x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:31] (03CR) 10Ladsgroup: [C:03+1] preseed.yaml, es2055.yaml, site.pp: Prepare es2055 for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1196696 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [15:37:46] (03CR) 10Ssingh: dnsrecursor: use config dir instead of standalone file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [15:38:28] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:39:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:40:44] (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml, es2055.yaml, site.pp: Prepare es2055 for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1196696 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [15:40:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:42:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on es2055.codfw.wmnet with reason: Setting up new ES host [15:42:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:43:12] (03PS6) 10CDanis: haproxy: silent-drop as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1176302 [15:43:17] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:44:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:44:36] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:45:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:45:55] (03CR) 10Vgutierrez: [C:03+1] haproxy: silent-drop as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1176302 (owner: 10CDanis) [15:46:47] RECOVERY - Host ml-serve1012 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:47:30] (03CR) 10CDanis: [C:03+2] haproxy: silent-drop as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1176302 (owner: 10CDanis) [15:49:55] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:51:33] FIRING: [2x] KubernetesCalicoDown: ml-serve1012.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:51:43] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:52:15] (03PS1) 10CDanis: haproxy: early drop: try again [puppet] - 10https://gerrit.wikimedia.org/r/1196713 [15:52:43] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:52:45] FIRING: Traffic on tunnel link: Alert for device cr2-eqdfw.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [15:53:12] (03PS1) 10Cathal Mooney: gnmic: add collection for Nokia OSPF states [puppet] - 10https://gerrit.wikimedia.org/r/1196714 (https://phabricator.wikimedia.org/T405558) [15:53:14] (03CR) 10Vgutierrez: [C:03+1] haproxy: early drop: try again [puppet] - 10https://gerrit.wikimedia.org/r/1196713 (owner: 10CDanis) [15:53:31] (03PS2) 10CDanis: haproxy: early drop: try again [puppet] - 10https://gerrit.wikimedia.org/r/1196713 [15:53:36] (03CR) 10CDanis: [C:03+2] haproxy: early drop: try again [puppet] - 10https://gerrit.wikimedia.org/r/1196713 (owner: 10CDanis) [15:53:46] (03CR) 10CI reject: [V:04-1] gnmic: add collection for Nokia OSPF states [puppet] - 10https://gerrit.wikimedia.org/r/1196714 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [15:54:19] (03CR) 10CDanis: [V:03+2 C:03+2] haproxy: early drop: try again [puppet] - 10https://gerrit.wikimedia.org/r/1196713 (owner: 10CDanis) [15:54:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:54:55] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:56:28] jouncebot: nowandnext [15:56:28] For the next 0 hour(s) and 3 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1500) [15:56:28] In 0 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1600) [15:56:34] I'll charlie out some envoy updates if there's nothing ongoing for a bit [15:56:37] (03PS2) 10Cathal Mooney: gnmic: add collection for Nokia OSPF states [puppet] - 10https://gerrit.wikimedia.org/r/1196714 (https://phabricator.wikimedia.org/T405558) [15:56:41] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:57:07] (03CR) 10CI reject: [V:04-1] gnmic: add collection for Nokia OSPF states [puppet] - 10https://gerrit.wikimedia.org/r/1196714 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [15:57:09] (03PS6) 10Aaron Schulz: Route "/api/rest_v1/?spec" requests to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) [15:57:33] (03CR) 10CI reject: [V:04-1] Route "/api/rest_v1/?spec" requests to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [15:57:45] RESOLVED: Traffic on tunnel link: Device cr2-eqdfw.wikimedia.org recovered from Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [15:59:40] RESOLVED: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:00:01] (03PS3) 10Cathal Mooney: gnmic: add collection for Nokia OSPF states [puppet] - 10https://gerrit.wikimedia.org/r/1196714 (https://phabricator.wikimedia.org/T405558) [16:00:05] jhathaway and moritzm: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1600). nyaa~ [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:13] (03PS7) 10Aaron Schulz: Route "/api/rest_v1/?spec" requests to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) [16:01:03] 10ops-eqiad, 06DC-Ops: Inbound errors on interface cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://phabricator.wikimedia.org/T407510 (10phaultfinder) 03NEW [16:01:41] RESOLVED: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:02:47] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:04:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11281808 (10Jhancock.wm) @ssingh 2043 and 2044 have been reimaged. so it's all yours! @elukey i spaced we have a new os lol. i tried to do bullseye per the ori... [16:05:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11281812 (10VRiley-WMF) [16:07:11] FIRING: [17x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:28] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1196631 (https://phabricator.wikimedia.org/T406000) (owner: 10Joal) [16:08:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:09:14] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513 (10MatthewVernon) 03NEW [16:10:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:10:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:11:09] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:13:42] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone_es of es2032.codfw.wmnet onto es2055.codfw.wmnet [16:13:47] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2032 - Depool es2032.codfw.wmnet to then clone it to es2055.codfw.wmnet - fceratto@cumin1003 [16:14:17] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2032 - Depool es2032.codfw.wmnet to then clone it to es2055.codfw.wmnet - fceratto@cumin1003 [16:15:35] 06SRE, 06Infrastructure-Foundations, 10netops: mr1-codfw is single-homed to lsw1-a2-codfw - https://phabricator.wikimedia.org/T407488#11281891 (10Papaul) I do agree with you that we should have redundancy link to another switch. I have been thinking also for long term on the mgmt network design if we will h... [16:15:39] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio200[1-3] - https://phabricator.wikimedia.org/T405982#11281893 (10Jhancock.wm) @Jgreen we got these drives in. I can install the one in franio2003 when i install the server. As for the other two, can i install the disks... [16:17:17] fceratto@cumin1003 clone_es (PID 3896502) is awaiting input [16:18:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:19:16] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1018.eqiad.wmnet with reason: remove lvs1018 enp94s0f0np0 link to rack E1 [16:19:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11281905 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=90f044fa-0459-4db3-89e0-7542b1906768) set by cmooney@cumin1003 for 2:00:... [16:20:17] !log disable BGP sessions for lvs1018 on cr1-eqiad, cr2-eqiad to move traffic to backup load-balancer lvs1020 T405499 [16:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:21] T405499: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499 [16:23:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:31:45] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for JavierMonton - https://phabricator.wikimedia.org/T407187#11281964 (10Ahoelzl) Approved, both analytics-admins and deployment. [16:34:23] FIRING: [17x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:42] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:35:22] (03CR) 10Aaron Schulz: Route "/api/rest_v1/?spec" requests to the rest gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [16:37:11] FIRING: [17x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:59] (belatedly, sorry - not actually doing that envoy deployment right now after all) [16:44:23] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:44:46] (03CR) 10Cathal Mooney: [C:03+2] lvs1018: remove L2 sub-interface config for row E/F vlans [puppet] - 10https://gerrit.wikimedia.org/r/1191109 (https://phabricator.wikimedia.org/T405499) (owner: 10Cathal Mooney) [16:46:02] !log reprepro include php8.3_8.3.26-1+wmf11u2 in component/php83 [16:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:53] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:50:38] (03CR) 10Scott French: [V:03+2] "Built locally:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1196701 (owner: 10Scott French) [16:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:56:23] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1018.eqiad.wmnet [16:56:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11282052 (10ops-monitoring-bot) Host lvs1018.eqiad.wmnet rebooted by brett@cumin2002 with reason: None [16:57:22] !log mforns@deploy2002 Started deploy [analytics/refinery@6b7edca] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6b7edcac] [16:58:38] !log mforns@deploy2002 Finished deploy [analytics/refinery@6b7edca] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6b7edcac] (duration: 01m 16s) [16:59:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1018.eqiad.wmnet [16:59:47] PROBLEM - Host lvs1018 is DOWN: PING CRITICAL - Packet loss = 100% [16:59:49] RECOVERY - Host lvs1018 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [17:00:03] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:00:05] bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1700) [17:00:06] swfrench-wmf: May I have your attention please! MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1700) [17:00:27] o/ [17:00:41] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [17:00:47] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:00:57] (03CR) 10Scott French: [V:03+2] "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1196701 (owner: 10Scott French) [17:00:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11282060 (10cmooney) Sorry for the run around guys, looking at the schedule I think it'll h... [17:00:59] (03CR) 10Scott French: [V:03+2 C:03+2] php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1196701 (owner: 10Scott French) [17:01:09] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:01:18] ^lvs1018 errors are expected [17:01:32] well, not really expected but they're not an issue atm [17:02:23] thanks, b.rett! [17:03:27] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=18) https://wikitech.wikimedia.org/wiki/PyBal [17:04:47] RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:05:03] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:05:23] (03CR) 10Vgutierrez: [C:03+1] haproxy: add JA4H support [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [17:06:29] FYI, I'll be kicking off a scap deployment shortly. given that it will be picking up a new PHP production image, it should take ~ 30 minutes to complete. [17:06:33] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [17:06:43] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#11282101 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2003.codfw.wmnet with OS bookworm [17:08:27] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 18 connections established with conf1007.eqiad.wmnet:4001 (min=18) https://wikitech.wikimedia.org/wiki/PyBal [17:09:13] !log mforns@deploy2002 Started deploy [analytics/refinery@6b7edca]: Regular analytics weekly train [analytics/refinery@6b7edcac] [17:09:24] oh interesting ... [17:09:48] brett: your https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196141 will now be deployed to mediawiki. is that good to go? [17:09:56] swfrench-wmf: Yes, thank you! [17:10:48] * swfrench-wmf now realizes this is why httpbb checks are failing [17:10:49] !log re-enable BGP sessions for lvs1018 on cr1-eqiad, cr2-eqiad after maintenance on the lvs host T405499 [17:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:53] T405499: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499 [17:10:58] brett: ack, thanks for confirming - doing [17:12:09] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:12:10] nothing for me to deploy during my window this week. [17:12:31] !log swfrench@deploy2002 Started scap sync-world: New PHP 8.3 production image [17:12:55] ^ httpbb checks will recover with this deployment completes [17:16:01] !log mforns@deploy2002 Finished deploy [analytics/refinery@6b7edca]: Regular analytics weekly train [analytics/refinery@6b7edcac] (duration: 06m 48s) [17:16:18] !log mforns@deploy2002 Started deploy [analytics/refinery@6b7edca] (thin): Regular analytics weekly train THIN [analytics/refinery@6b7edcac] [17:17:47] !log mforns@deploy2002 Finished deploy [analytics/refinery@6b7edca] (thin): Regular analytics weekly train THIN [analytics/refinery@6b7edcac] (duration: 01m 29s) [17:22:21] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:22:58] (03CR) 10Btullis: [C:03+2] Update sqoop for mediawiki_history [puppet] - 10https://gerrit.wikimedia.org/r/1196631 (https://phabricator.wikimedia.org/T406000) (owner: 10Joal) [17:23:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11282183 (10cmooney) @Jclark-ctr looking at the timetable this would mean moving the ASW li... [17:24:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:24:22] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:24:23] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [17:24:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:24:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11282188 (10cmooney) 05Open→03Resolved a:03cmooney Ok all works completed and things looking good. I'll close this task and advise DC-Ops... [17:26:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11282214 (10Jclark-ctr) That works for me [17:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:28:43] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [17:29:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:29:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.937s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:29:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:29:54] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11282237 (10cmooney) @VRiley-WMF hey just to let you know we finished T405499 this evening. So as described above we can now re-use that [[ https://netbox.wikimedia.org/dcim/interfaces/34975/... [17:33:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:34:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.937s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:34:23] FIRING: [17x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:35:33] (03PS1) 10CDanis: haproxy: silent-drop: lower limit [puppet] - 10https://gerrit.wikimedia.org/r/1196723 [17:37:11] FIRING: [17x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:38:47] !log swfrench@deploy2002 Finished scap sync-world: New PHP 8.3 production image (duration: 27m 32s) [17:39:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.652s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:41:40] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:44:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.469s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:46:40] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:47:11] FIRING: [17x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:47:27] jouncebot: nowandnext [17:47:27] For the next 0 hour(s) and 12 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1700) [17:47:27] For the next 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T1700) [17:47:27] In 2 hour(s) and 12 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T2000) [17:50:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:52:09] (03CR) 10HMonroy: [C:03+1] Enable Special:EditWatchlist pagination on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196703 (https://phabricator.wikimedia.org/T41510) (owner: 10Cparle) [17:54:41] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2003.codfw.wmnet with OS bookworm [17:54:49] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#11282370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2003.codfw.wmnet with OS bookworm completed: - sretest2003 (**WARN**)... [17:56:40] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:56:55] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:58:15] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#11282408 (10Jhancock.wm) @Marostegui you have to root into the server and view the console to see what the installer is doing. it had an issue with the drives. expected since it needed t... [18:01:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:01:55] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:02:08] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:02:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:04:23] FIRING: [17x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:06:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:08:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:08:40] !log Import varnish 7.1.1-2~bpo13+wmf1 into trixie-wikimedia - T401832 [18:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:48] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [18:10:14] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:12:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:14:23] FIRING: [16x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:17:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:17:55] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:18:02] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:18:40] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:24:33] (03CR) 10BCornwall: [C:03+2] wikimedia.support: Rm ncredir, add zendesk records [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [18:25:10] !log brett@dns1004 START - running authdns-update [18:26:26] !log brett@dns1004 END - running authdns-update [18:26:37] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [18:26:43] (03CR) 10BCornwall: [C:03+2] Remove wikimedia.support from ncredir/acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/1192283 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [18:27:22] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:27:43] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [18:28:00] PROBLEM - Host wikikube-worker2095 is DOWN: CRITICAL - Time to live exceeded (10.192.14.10) [18:28:03] PROBLEM - Host db2153 #page is DOWN: CRITICAL - Time to live exceeded (10.192.0.4) [18:28:03] PROBLEM - Host db2199 is DOWN: CRITICAL - Time to live exceeded (10.192.6.8) [18:28:03] PROBLEM - Host conf2004 is DOWN: CRITICAL - Time to live exceeded (10.192.16.45) [18:28:05] PROBLEM - Host db2187 #page is DOWN: CRITICAL - Time to live exceeded (10.192.48.208) [18:28:15] PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100% [18:28:19] RECOVERY - Host db2153 #page is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms [18:28:20] RECOVERY - Host wikikube-worker2095 is UP: PING OK - Packet loss = 0%, RTA = 33.37 ms [18:28:23] RECOVERY - Host db2187 #page is UP: PING OK - Packet loss = 0%, RTA = 33.28 ms [18:28:23] RECOVERY - Host db2199 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms [18:28:23] RECOVERY - Host conf2004 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [18:28:31] that's an interesting one [18:28:32] PROBLEM - Host wikikube-worker2203 is DOWN: PING CRITICAL - Packet loss = 100% [18:28:32] PROBLEM - Host ms-be2078 is DOWN: PING CRITICAL - Packet loss = 100% [18:28:33] !incidents [18:28:34] 6878 (RESOLVED) Host db2187 (paged) [18:28:34] hmmm [18:28:34] 6877 (RESOLVED) Host db2153 (paged) [18:28:34] 6875 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [18:28:34] 6871 (RESOLVED) [2x] ProbeDown sre (ip6 text-https:443 probes/service http_text-https_ip6) [18:28:35] (03PS2) 10BCornwall: Remove wikimedia.support from ncredir/acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/1192283 (https://phabricator.wikimedia.org/T400952) [18:28:37] netwrk event or racks? [18:28:38] RECOVERY - Host doh5001 is UP: PING OK - Packet loss = 0%, RTA = 231.16 ms [18:28:47] monitoring most likely given the different sites [18:28:47] Very fun day [18:28:51] TTL exceeded .. in internal network? [18:29:03] !incidents [18:29:03] 6878 (RESOLVED) Host db2187 (paged) [18:29:03] 6877 (RESOLVED) Host db2153 (paged) [18:29:03] 6875 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [18:29:04] 6871 (RESOLVED) [2x] ProbeDown sre (ip6 text-https:443 probes/service http_text-https_ip6) [18:29:08] fwiw, postfix on crm2001 looks fine [18:29:50] other hosts look fine as well [18:29:59] * swfrench-wmf nods [18:30:00] Resolved on its own [18:30:03] don't seem to have actually gone down so I am going to attribute it to monitoring [18:30:11] but what happened here, we should check that probably [18:30:58] it started with PROBLEM - OSPF status on cr1-eqiad is CRITICAL: [18:30:59] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:31:09] mutante: yes [18:32:34] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:32:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:33:02] alert1002 looks //slightly// unhappy during this period but nothing too bad? [18:33:05] https://grafana.wikimedia.org/goto/xz36rweHg?orgId=1 [18:33:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:33:31] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:34:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:35:05] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:35:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:35:22] sukhe: I think the TCP and socket errors on alert1002 could also be related to the OSPF adjacency.. [18:35:51] they are all in codfw - but its not all the same rack. checked in netbox [18:36:14] except doh5001 does not match the pattern at all [18:37:09] mutante: probably the source of the monitoring traffic is in eqiad [18:37:15] mutante: alert1002 is the primary monitoring host [18:37:29] FIRING: [11x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:37:58] https://librenms.wikimedia.org/device/device=1/tab=port/port=28260/ https://librenms.wikimedia.org/device/device=92/tab=port/port=28396/ this transport link is showing some errors on both sides [18:38:11] but has been for a few hours [18:38:50] cdanis: yes it is in eqiad and it does not have a "remote prober" as prometheus [18:39:03] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:40:29] what tappof says would explain that the alert on the core router happened right then. if it was only the alerting server that could explain all others but not the cr alert.. since that actually said one OSPFv3 was down (5/5 vs 6/6) ? [18:40:43] PROBLEM - Host doh4001 is DOWN: PING CRITICAL - Packet loss = 100% [18:40:48] ha [18:40:58] PROBLEM - Host doh2001 is DOWN: PING CRITICAL - Packet loss = 100% [18:40:58] PROBLEM - Host logstash2031 is DOWN: PING CRITICAL - Packet loss = 100% [18:40:59] tappof: can you take a look at alert1002 to see if anything stands out? [18:41:02] PROBLEM - Host doh4002 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:10] RECOVERY - Host doh4002 is UP: PING OK - Packet loss = 0%, RTA = 72.75 ms [18:41:14] RECOVERY - Host logstash2031 is UP: PING OK - Packet loss = 0%, RTA = 30.53 ms [18:41:15] none of these hosts are down [18:41:15] yeah [18:41:16] RECOVERY - Host doh2001 is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms [18:41:16] RECOVERY - Host doh4001 is UP: PING OK - Packet loss = 0%, RTA = 71.48 ms [18:41:20] PROBLEM - Host wikikube-worker2203 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:20] PROBLEM - Host ms-be2078 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:37] the icinga logstash dashboard doesn't seem to work anymore [18:41:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (2001:7f8:36::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:41:52] PROBLEM - gerrit process on gerrit2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit/review_site/bin/gerrit.war daemon -d /var/lib/gerrit/review_site https://wikitech.wikimedia.org/wiki/Gerrit [18:41:54] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2001.codfw.wmnet, dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:41:59] fun [18:42:04] so, the ~ 18:28 event looks a lot like something flapped [18:42:06] sukhe: yeah, sure [18:42:39] and it just happened again [18:42:42] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone_es (exit_code=99) of es2032.codfw.wmnet onto es2055.codfw.wmnet [18:43:04] I can't see anything in librenms but I am sure I am not looking at the right place [18:43:17] hitting cross-DC rest-gateway -> mw-api-ext traffic pretty hard [18:43:18] that gerrit alert is not new.. it's more like Icinga caught up with something [18:43:31] https://grafana.wikimedia.org/goto/Li5ajQeNR?orgId=1 [18:44:05] we're serving 1krps of errors from the cdn? [18:44:11] cdanis: yes [18:44:19] on each one of these flaps [18:44:45] swfrench-wmf: from esams and drmrs [18:44:50] and magru and eqiad [18:44:57] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on gerrit2003.wikimedia.org with reason: no active host - disabled [18:45:11] cdanis: exactly, yeah - things that will hit rest-gateway in eqiad [18:45:22] and then have rest-gateway have to go cross-DC to get to mw-api-ext [18:45:37] yep [18:48:03] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:48:47] swfrench-wmf: https://grafana.wikimedia.org/d/f87546b4-df0f-4c2a-8656-8658460a4586/network-bgp-overview?from=now-1h&orgId=1&timezone=utc&to=now&viewPanel=panel-880 [18:50:51] the icinga log seems to rotate daily. there are 3776 "Socket timeout"s [18:53:20] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:53:47] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:53:50] cdanis: oh, interesting! trying to figure out how to read this ... does this show adjacencies where the number of sessions in establishes is less than expected? [18:54:00] *established [18:54:09] swfrench-wmf: something like that I think. having data points on that graph == badness [18:54:55] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:55:20] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95152319 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:55:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:55:54] Yeah, mutante, but most of the timeouts start around Thu Oct 16 06:40:07 PM UTC 2025. [18:57:55] !log andrew@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM cloudbackup1002-dev.eqiad.wmnet [18:58:10] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:00:10] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:00:44] FIRING: [6x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:00:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:03:44] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:04:08] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:05:44] FIRING: [8x] RipeAtlasAnchorUnreachable: ipv6 ping to esams RIPE Atlas anchor: failures over threshold for measurement 59935539 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:06:39] RESOLVED: [4x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [19:06:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:08:10] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:10:22] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:10:44] FIRING: [8x] RipeAtlasAnchorUnreachable: ipv6 ping to esams RIPE Atlas anchor: failures over threshold for measurement 59935539 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:10:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:11:40] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:11:52] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:13:06] (03CR) 10Dzahn: [C:03+1] "has approvals now and lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1196010 (https://phabricator.wikimedia.org/T407187) (owner: 10JavierMonton) [19:15:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:16:40] FIRING: [6x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:16:55] FIRING: [7x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:19:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:20:10] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:20:44] FIRING: [8x] RipeAtlasAnchorUnreachable: ipv6 ping to esams RIPE Atlas anchor: failures over threshold for measurement 59935539 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:21:40] FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:21:55] FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:22:04] (03CR) 10JHathaway: [C:03+1] Remove Hiera option to disable agent forwarding [puppet] - 10https://gerrit.wikimedia.org/r/1189855 (https://phabricator.wikimedia.org/T198138) (owner: 10Muehlenhoff) [19:22:15] !log dancy@deploy2002 Installing scap version "4.214.0" for 2 host(s) [19:23:18] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:25:02] !log dancy@deploy2002 Installation of scap version "4.214.0" completed for 2 hosts [19:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:06] (03CR) 10Aaron Schulz: "One question, others LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196492 (https://phabricator.wikimedia.org/T389409) (owner: 10BPirkle) [19:25:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196460 (https://phabricator.wikimedia.org/T407358) (owner: 10NMW03) [19:25:44] RESOLVED: [8x] RipeAtlasAnchorUnreachable: ipv6 ping to esams RIPE Atlas anchor: failures over threshold for measurement 59935539 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:26:40] FIRING: [10x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:31:40] FIRING: [9x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:31:55] FIRING: [10x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:36:40] RESOLVED: [10x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:38:04] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:38:59] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:41:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196127 (https://phabricator.wikimedia.org/T390858) (owner: 10Ebernhardson) [19:41:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196127 (https://phabricator.wikimedia.org/T390858) (owner: 10Ebernhardson) [19:41:56] (03PS2) 10Ebernhardson: Revert "cirrus: Start AB test of did-you-mean profiles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196127 (https://phabricator.wikimedia.org/T390858) [19:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:46:55] FIRING: [7x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:49:14] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:50:14] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:51:34] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:51:40] FIRING: [6x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:53:54] (03CR) 10Andrew Bogott: [C:03+1] cloudceph: handle double / single NIC transition [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [19:54:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196531 (https://phabricator.wikimedia.org/T407281) (owner: 10Hamish) [19:56:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T2000). [20:00:05] Nemoralis, ebernhardson, and hamishcz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:24] :) im here [20:01:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:01:27] \o [20:03:24] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:03:45] i suppose i can do the deployment [20:03:51] Nemoralis: around for deploy? [20:04:14] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:05:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196127 (https://phabricator.wikimedia.org/T390858) (owner: 10Ebernhardson) [20:05:24] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:05:54] (03Merged) 10jenkins-bot: Revert "cirrus: Start AB test of did-you-mean profiles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196127 (https://phabricator.wikimedia.org/T390858) (owner: 10Ebernhardson) [20:06:13] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1196127|Revert "cirrus: Start AB test of did-you-mean profiles" (T390858)]] [20:06:17] T390858: Improve CirrusSearch DYM suggestions using the phrase suggester on more content - https://phabricator.wikimedia.org/T390858 [20:06:40] FIRING: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:07:55] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:08:28] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:10:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:10:56] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1196127|Revert "cirrus: Start AB test of did-you-mean profiles" (T390858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:11:29] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [20:11:40] FIRING: [6x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:14:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:15:14] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:15:49] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196127|Revert "cirrus: Start AB test of did-you-mean profiles" (T390858)]] (duration: 09m 36s) [20:15:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:15:54] T390858: Improve CirrusSearch DYM suggestions using the phrase suggester on more content - https://phabricator.wikimedia.org/T390858 [20:16:40] RESOLVED: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:17:24] ebernhardson: hi, would u help me with mine please? [20:17:42] hamishcz: yup, can ship it now [20:18:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196531 (https://phabricator.wikimedia.org/T407281) (owner: 10Hamish) [20:18:30] PROBLEM - Host ms-be2068 is DOWN: CRITICAL - Time to live exceeded (10.192.32.91) [20:18:30] PROBLEM - Host wikikube-worker2063 is DOWN: CRITICAL - Time to live exceeded (10.192.5.26) [20:18:44] RECOVERY - Host ms-be2068 is UP: PING WARNING - Packet loss = 71%, RTA = 33.96 ms [20:18:44] RECOVERY - Host wikikube-worker2063 is UP: PING WARNING - Packet loss = 71%, RTA = 33.25 ms [20:18:55] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [20:19:29] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [20:19:42] (03Merged) 10jenkins-bot: Create "autopatrolled" user group on Danish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196531 (https://phabricator.wikimedia.org/T407281) (owner: 10Hamish) [20:19:59] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1196531|Create "autopatrolled" user group on Danish Wikisource (T407281)]] [20:20:03] T407281: Create "autopatrolled" user group on Danish Wikisource - https://phabricator.wikimedia.org/T407281 [20:20:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:21:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:21:24] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:21:40] FIRING: [6x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:21:56] ebernhardson: I am here [20:22:05] sorry, I was AFK [20:22:24] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:22:55] FIRING: [7x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:24:14] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:24:56] !log ebernhardson@deploy2002 ebernhardson, hamishz: Backport for [[gerrit:1196531|Create "autopatrolled" user group on Danish Wikisource (T407281)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:25:55] hamishcz: can you check that it works? [20:26:06] my quick review of Special:UserGroupRights suggests it probably does [20:26:18] yea i checked and LGTM [20:26:28] !log ebernhardson@deploy2002 ebernhardson, hamishz: Continuing with sync [20:26:40] RESOLVED: [7x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:30:40] (03CR) 10Dzahn: [C:03+2] gerrit: disable gerrit service to enable backups [puppet] - 10https://gerrit.wikimedia.org/r/1196629 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [20:30:56] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196531|Create "autopatrolled" user group on Danish Wikisource (T407281)]] (duration: 10m 57s) [20:31:00] T407281: Create "autopatrolled" user group on Danish Wikisource - https://phabricator.wikimedia.org/T407281 [20:31:22] Nemoralis: are you around for deployment? [20:31:26] yes [20:31:33] awesome, you're next [20:32:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196460 (https://phabricator.wikimedia.org/T407358) (owner: 10NMW03) [20:32:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:33:11] (03Merged) 10jenkins-bot: Add wgSitename for azwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196460 (https://phabricator.wikimedia.org/T407358) (owner: 10NMW03) [20:33:31] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1196460|Add wgSitename for azwiktionary (T407358)]] [20:33:35] T407358: set $wgSitename for the azerbaijani wiktionary - https://phabricator.wikimedia.org/T407358 [20:33:55] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:35:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:37:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:37:58] (03CR) 10Dzahn: "What I can confirm is that using --chown with rsync has not really been working and doing rsync first and then "find .. -exec chown / chmo" [cookbooks] - 10https://gerrit.wikimedia.org/r/1196684 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [20:37:59] PROBLEM - Auth DNS #page on ns1-v4 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:38:09] !log ebernhardson@deploy2002 ebernhardson, nmw03: Backport for [[gerrit:1196460|Add wgSitename for azwiktionary (T407358)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:38:24] tested, LGTM [20:38:28] RESOLVED: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:38:34] Nemoralis: awesome [20:38:36] !log ebernhardson@deploy2002 ebernhardson, nmw03: Continuing with sync [20:38:52] !incidents [20:38:52] You're not allowed to perform this action. [20:38:57] RECOVERY - Auth DNS #page on ns1-v4 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:38:58] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:39:16] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:39:43] !incidents [20:39:43] 6879 (RESOLVED) ns1-v4/Auth DNS (paged) [20:39:43] 6878 (RESOLVED) Host db2187 (paged) [20:39:43] 6877 (RESOLVED) Host db2153 (paged) [20:39:44] 6875 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [20:39:44] 6871 (RESOLVED) [2x] ProbeDown sre (ip6 text-https:443 probes/service http_text-https_ip6) [20:40:14] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:40:16] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [20:40:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:41:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (PUT secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:42:40] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:43:00] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196460|Add wgSitename for azwiktionary (T407358)]] (duration: 09m 29s) [20:43:04] T407358: set $wgSitename for the azerbaijani wiktionary - https://phabricator.wikimedia.org/T407358 [20:43:43] deployment window complete [20:43:55] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:43:59] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [20:44:25] thanks! [20:44:33] np [20:46:22] (03CR) 10Dzahn: "I found the notes from a switchover in the past:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1196684 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [20:47:26] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:49:11] (03CR) 10Dzahn: [C:03+1] "https://phabricator.wikimedia.org/T387833#11283327" [cookbooks] - 10https://gerrit.wikimedia.org/r/1196684 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [20:50:26] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:50:55] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:51:06] !log disabling cr1-eqiad:et-1/1/2 and cr1-codfw:et-1/0/2 (both ends of same Arelion transport, been erroring/flapping for a while) [20:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PUT secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:52:52] (03PS1) 10BCornwall: cdn.roll-reboot: Run puppet as post_action [cookbooks] - 10https://gerrit.wikimedia.org/r/1196756 [20:53:40] FIRING: [6x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:55:11] (03CR) 10Dzahn: [C:03+1] gerrit: stop puppet across all instances [cookbooks] - 10https://gerrit.wikimedia.org/r/1196694 (https://phabricator.wikimedia.org/T407200) (owner: 10Arnaudb) [20:55:55] RESOLVED: [5x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:56:16] (03CR) 10Dzahn: [C:03+1] "curious: did you need any other step after merging this / any coordination with serviceops ? or did it just work after merging and waitin" [puppet] - 10https://gerrit.wikimedia.org/r/1196141 (https://phabricator.wikimedia.org/T407156) (owner: 10BCornwall) [20:56:55] !log see also https://phabricator.wikimedia.org/T407578 for above port disables [20:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:11] (03CR) 10BCornwall: [C:03+2] "swfrench was kind enough to deploy it for me - but I did just plan on waiting. He was just very proactive in getting it out :)" [puppet] - 10https://gerrit.wikimedia.org/r/1196141 (https://phabricator.wikimedia.org/T407156) (owner: 10BCornwall) [20:58:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.968s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:58:45] (03CR) 10Dzahn: [C:03+1] "heh, I see! I have always wondered this about changes to redirects.dat. Still not sure what the deployment involves but cool." [puppet] - 10https://gerrit.wikimedia.org/r/1196141 (https://phabricator.wikimedia.org/T407156) (owner: 10BCornwall) [20:59:48] (03CR) 10Dzahn: "gotcha. makes sense since there were also 2 tickets" [puppet] - 10https://gerrit.wikimedia.org/r/1196396 (https://phabricator.wikimedia.org/T406106) (owner: 10Ladsgroup) [20:59:59] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7004.* [21:00:00] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7004.magru.wmnet with reason: Debugging sre.cdn.roll-reboot bugs [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251016T2100) [21:02:04] o/ is anyone in the process of doing any deployments? If so please speak now :) [21:03:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.477s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:03:34] (03PS1) 10Jdlrobson: Temporary user banner should not have such a high z-index [skins/MinervaNeue] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196760 (https://phabricator.wikimedia.org/T407549) [21:03:49] okay proceeding now! [21:04:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196760 (https://phabricator.wikimedia.org/T407549) (owner: 10Jdlrobson) [21:04:46] (03CR) 10Dzahn: [C:03+1] "@FNegri If we want it to apply to "any instance in cloud VPS, regardless of project" then it can be in hieradata/cloud.yaml or it can be d" [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [21:08:12] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp7004.magru.wmnet} and A:cp [21:09:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:14:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:20:28] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7004.magru.wmnet [21:20:28] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp7004.magru.wmnet} and A:cp [21:22:19] ebernhardson: a late thanks [21:22:40] (03PS2) 10BPirkle: Enable REST Sandbox on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196492 (https://phabricator.wikimedia.org/T389409) [21:23:31] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [21:24:40] !incidents [21:24:40] You're not allowed to perform this action. [21:25:08] jasmine_: I did not add the underscore [21:25:21] do you always have that ? [21:26:19] !incidents [21:26:19] 6879 (RESOLVED) ns1-v4/Auth DNS (paged) [21:26:20] 6878 (RESOLVED) Host db2187 (paged) [21:26:20] 6877 (RESOLVED) Host db2153 (paged) [21:26:20] 6875 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [21:26:20] 6871 (RESOLVED) [2x] ProbeDown sre (ip6 text-https:443 probes/service http_text-https_ip6) [21:26:24] \o/ [21:26:33] cool [21:26:33] ah nice, ty mutante [21:26:34] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7004.* [21:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:30:53] (03Merged) 10jenkins-bot: Temporary user banner should not have such a high z-index [skins/MinervaNeue] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196760 (https://phabricator.wikimedia.org/T407549) (owner: 10Jdlrobson) [21:31:12] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1196760|Temporary user banner should not have such a high z-index (T407549)]] [21:31:16] T407549: Temp accounts banner overlay blocks content on mobile - https://phabricator.wikimedia.org/T407549 [21:31:33] kostajh: on debug soon if you want to get ready to test [21:32:50] (03Abandoned) 10BCornwall: cdn.roll-reboot: Run puppet as post_action [cookbooks] - 10https://gerrit.wikimedia.org/r/1196756 (owner: 10BCornwall) [21:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:35:36] !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1196760|Temporary user banner should not have such a high z-index (T407549)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:36:08] kostajh: testing now [21:37:12] LGTM [21:39:29] Hey folks - ok if I deploy a quick security patch? Web Development team? [21:39:50] sbassett: [21:39:54] almost done [21:40:07] tx [21:40:20] I am just waiting for kostajh or volker to test. If I don't hear from them in 5 min I'll proceed with the sync [21:42:12] ok continuing [21:42:16] !log jdlrobson@deploy2002 jdlrobson: Continuing with sync [21:46:33] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196760|Temporary user banner should not have such a high z-index (T407549)]] (duration: 15m 21s) [21:46:37] T407549: Temp accounts banner overlay blocks content on mobile - https://phabricator.wikimedia.org/T407549 [21:53:29] sbassett: all yours [21:54:06] tx [21:58:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11283517 (10BCornwall) Yes, good for me. I'm assuming you meant November 4 as per your othe... [22:04:29] !log Deployed security fix for T407131 [22:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:14] sbassett: all through? I'll do some stuff if it's all clear [22:14:03] rzl: yes, all yours. [22:14:08] thanks! [22:14:09] (03PS1) 10Andrew Bogott: preseed: remove redundant (and incorrect) def for cloudcontrol2010-dev [puppet] - 10https://gerrit.wikimedia.org/r/1196772 [22:14:53] (03PS1) 10Cwhite: logstash: initial dmarc-> ecs filter [puppet] - 10https://gerrit.wikimedia.org/r/1196773 (https://phabricator.wikimedia.org/T404888) [22:16:36] (03PS1) 10Zabe: PS.php: Add analytics-web service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196774 (https://phabricator.wikimedia.org/T309738) [22:16:52] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/apertium: apply [22:16:53] (03CR) 10Andrew Bogott: [C:03+2] preseed: remove redundant (and incorrect) def for cloudcontrol2010-dev [puppet] - 10https://gerrit.wikimedia.org/r/1196772 (owner: 10Andrew Bogott) [22:17:24] (03CR) 10CI reject: [V:04-1] PS.php: Add analytics-web service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196774 (https://phabricator.wikimedia.org/T309738) (owner: 10Zabe) [22:17:37] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/apertium: apply [22:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:19:52] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [22:21:33] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1196775 [22:21:38] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1196776 [22:21:42] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1196777 [22:23:47] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [22:24:26] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [22:24:34] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [22:25:18] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [22:28:52] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [22:29:22] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [22:29:56] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:30:02] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:31:04] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/commons-impact-analytics: apply [22:31:22] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/commons-impact-analytics: apply [22:31:52] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [22:32:25] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [22:32:47] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/data-gateway: apply [22:33:07] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [22:33:38] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/device-analytics: apply [22:33:55] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [22:34:23] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:34:39] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply [22:35:37] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply [22:36:01] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [22:36:17] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [22:36:35] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [22:36:49] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [22:37:10] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [22:37:42] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [22:37:43] (03CR) 10Pppery: "These should probably go to the paid editing blog post." [puppet] - 10https://gerrit.wikimedia.org/r/1196776 (owner: 10Ncmonitor) [22:37:57] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [22:38:23] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [22:38:42] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [22:39:20] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [22:39:39] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [22:40:17] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [22:40:52] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1196776 (owner: 10Ncmonitor) [22:41:06] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [22:41:56] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [22:42:17] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [22:43:15] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [22:44:28] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [22:44:50] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [22:46:00] 06SRE, 06cloud-services-team: latested Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586 (10Andrew) 03NEW [22:46:06] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [22:46:26] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [22:47:11] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/ipoid: apply [22:47:35] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [22:48:27] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [22:49:34] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [22:49:48] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [22:55:56] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [22:57:45] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mathoid: apply [22:58:15] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [22:58:47] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/media-analytics: apply [22:59:01] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [22:59:24] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [22:59:37] (03CR) 10Brennen Bearnes: [C:03+1] "Confirming that I tested this change using puppetmaster-1003 and phabricator-bullseye in devtools. Nothing seems to break." [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [23:02:10] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [23:03:25] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/page-analytics: apply [23:03:41] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [23:03:54] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [23:05:07] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [23:05:50] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/push-notifications: apply [23:06:30] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [23:06:40] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/recommendation-api: apply [23:07:05] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply [23:07:53] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/sessionstore: apply [23:08:11] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [23:09:14] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [23:09:49] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [23:10:07] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [23:10:23] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [23:11:08] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [23:11:24] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [23:11:38] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [23:12:06] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [23:13:17] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [23:13:42] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [23:15:10] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [23:15:46] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [23:16:49] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [23:17:21] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [23:17:30] (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1196776 (owner: 10Ncmonitor) [23:17:54] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/termbox: apply [23:18:40] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/termbox: apply [23:18:55] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/toolhub: apply [23:19:33] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [23:20:00] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [23:20:18] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [23:20:33] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/zotero: apply [23:20:56] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/zotero: apply [23:27:46] (done) [23:31:11] nice :') [23:38:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1196778 [23:38:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1196778 (owner: 10TrainBranchBot) [23:38:43] andrew@cumin2002 reimage (PID 2391737) is awaiting input [23:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:51:52] /ac [23:53:26] 10SRE-swift-storage: File missing from four datacenters - https://phabricator.wikimedia.org/T407589 (10jlwoodwa) 03NEW [23:54:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1196778 (owner: 10TrainBranchBot)