[00:00:47] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1022.eqiad.wmnet with OS bookworm [00:00:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11033383 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with... [00:02:27] 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11033384 (10KFrancis) The NDA has been sent for signatures. I'll confirm when it's complete, Thanks! [00:02:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [00:02:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [00:08:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172438 [00:08:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172438 (owner: 10TrainBranchBot) [00:09:17] (03CR) 10Stang: "To deployer: please create securepoll_log table before merge this patch. Ref: T396483" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [00:21:25] 06SRE, 06serviceops, 10Wikifeeds: 504 responses (gateway timeout) for /api/rest_v1/feed/featured - https://phabricator.wikimedia.org/T400425#11033423 (10Scott_French) One additional data point: To find out exactly how slow these requests are (if indeed they succeed) I `nsenter`'d the netns of a wikifeeds po... [00:29:09] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172438 (owner: 10TrainBranchBot) [00:39:18] 06SRE, 06serviceops, 10Wikifeeds: 504 responses (gateway timeout) for /api/rest_v1/feed/featured - https://phabricator.wikimedia.org/T400425#11033433 (10Scott_French) [00:56:27] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:58:33] 06SRE, 06serviceops, 10Wikifeeds: 504 responses (gateway timeout) for /api/rest_v1/feed/featured - https://phabricator.wikimedia.org/T400425#11033437 (10Scott_French) [01:28:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 31 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [02:23:21] PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 156811 MB (4% inode=99%): /var/lib/hadoop/data/g 154199 MB (4% inode=99%): /var/lib/hadoop/data/j 155902 MB (4% inode=99%): /var/lib/hadoop/data/c 148795 MB (3% inode=99%): /var/lib/hadoop/data/b 153122 MB (4% inode=99%): /var/lib/hadoop/data/l 156182 MB (4% inode=99%): /var/lib/hadoop/data/k 155868 MB (4% inode=99%): /var/lib/hadoop/data [02:23:21] 5 MB (4% inode=99%): /var/lib/hadoop/data/i 153339 MB (4% inode=99%): /var/lib/hadoop/data/m 153563 MB (4% inode=99%): /var/lib/hadoop/data/d 154316 MB (4% inode=99%): /var/lib/hadoop/data/h 156677 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [02:34:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:18:50] 06SRE, 06serviceops, 10Wikifeeds, 07Chinese-Sites: 504 responses (gateway timeout) for /api/rest_v1/feed/featured - https://phabricator.wikimedia.org/T400425#11033508 (10Shizhao) [04:02:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [04:02:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [04:02:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [04:34:05] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:34:17] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:34:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:35:33] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:38:23] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:39:07] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.230 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:39:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:09:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:19:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:43:21] PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 153578 MB (4% inode=99%): /var/lib/hadoop/data/g 150611 MB (4% inode=99%): /var/lib/hadoop/data/j 153266 MB (4% inode=99%): /var/lib/hadoop/data/c 147545 MB (3% inode=99%): /var/lib/hadoop/data/b 151971 MB (4% inode=99%): /var/lib/hadoop/data/l 155067 MB (4% inode=99%): /var/lib/hadoop/data/k 150252 MB (4% inode=99%): /var/lib/hadoop/data [05:43:21] 9 MB (4% inode=99%): /var/lib/hadoop/data/i 150558 MB (4% inode=99%): /var/lib/hadoop/data/m 154663 MB (4% inode=99%): /var/lib/hadoop/data/d 153217 MB (4% inode=99%): /var/lib/hadoop/data/h 152638 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [05:46:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1236.eqiad.wmnet with reason: Maintenance [05:47:11] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1172511 (https://phabricator.wikimedia.org/T400435) [05:50:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Primary switchover es7 T400435 [05:50:37] T400435: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T400435 [05:51:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2039 with weight 0 T400435', diff saved to https://phabricator.wikimedia.org/P79888 and previous config saved to /var/cache/conftool/dbconfig/20250725-055105-root.json [05:52:00] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es2039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1172511 (https://phabricator.wikimedia.org/T400435) (owner: 10Gerrit maintenance bot) [05:53:07] !log Starting es7 codfw failover from es2038 to es2039 - T400435 [05:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2039 to es7 primary T400435', diff saved to https://phabricator.wikimedia.org/P79889 and previous config saved to /var/cache/conftool/dbconfig/20250725-055342-root.json [05:54:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2038 T400435', diff saved to https://phabricator.wikimedia.org/P79890 and previous config saved to /var/cache/conftool/dbconfig/20250725-055449-root.json [05:56:20] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2035 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1172512 (https://phabricator.wikimedia.org/T400436) [05:57:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Primary switchover es6 T400436 [05:57:48] T400436: Switchover es6 master (es2037 -> es2035) - https://phabricator.wikimedia.org/T400436 [05:57:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2035 with weight 0 T400436', diff saved to https://phabricator.wikimedia.org/P79891 and previous config saved to /var/cache/conftool/dbconfig/20250725-055749-root.json [05:58:25] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es2035 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1172512 (https://phabricator.wikimedia.org/T400436) (owner: 10Gerrit maintenance bot) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250725T0600) [06:00:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2035 to es6 primary T400436', diff saved to https://phabricator.wikimedia.org/P79892 and previous config saved to /var/cache/conftool/dbconfig/20250725-060005-marostegui.json [06:01:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2037 T400436', diff saved to https://phabricator.wikimedia.org/P79893 and previous config saved to /var/cache/conftool/dbconfig/20250725-060103-root.json [06:02:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2037.codfw.wmnet with reason: Maintenance [06:02:46] !log Starting es6 codfw failover from es2037 to es2035 - T400436 [06:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:50] T400436: Switchover es6 master (es2037 -> es2035) - https://phabricator.wikimedia.org/T400436 [06:05:31] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11033608 (10Marostegui) @Jhancock.wm es2037 is ready for you - homer was run. [06:12:05] (03PS1) 10Marostegui: db2222: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172513 (https://phabricator.wikimedia.org/T399955) [06:13:25] (03CR) 10Marostegui: [C:03+2] db2222: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172513 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [06:13:43] (03PS21) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [06:14:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2222.codfw.wmnet with reason: Maintenance [06:14:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2222 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79894 and previous config saved to /var/cache/conftool/dbconfig/20250725-061426-marostegui.json [06:14:44] (03CR) 10Elukey: sre.hosts.provision: add custom settings for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [06:22:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79895 and previous config saved to /var/cache/conftool/dbconfig/20250725-062204-root.json [06:22:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [06:37:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79896 and previous config saved to /var/cache/conftool/dbconfig/20250725-063710-root.json [06:38:37] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#11033651 (10ayounsi) 05Open→03Resolved All good, thanks a lot! [06:52:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79897 and previous config saved to /var/cache/conftool/dbconfig/20250725-065215-root.json [06:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250725T0700) [07:07:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2222 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79898 and previous config saved to /var/cache/conftool/dbconfig/20250725-070721-root.json [07:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:41:39] (03PS1) 10Kevin Bazira: ml-services: update RRML image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172527 (https://phabricator.wikimedia.org/T399437) [07:45:44] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, and yes on the AM API side access is controlled by 'profile::alertmanager::api::ro' and 'profile::alertmanager::api::rw'" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172334 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [07:47:45] (03CR) 10Federico Ceratto: [C:03+2] zarcillo: Enable egress for Alertmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172334 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [07:55:43] (03PS60) 10Arnaudb: gerrit: Bugfixes - dry run tests [cookbooks] - 10https://gerrit.wikimedia.org/r/1170160 (https://phabricator.wikimedia.org/T387833) [07:55:43] (03CR) 10Arnaudb: "This change will bring the various fixes resulting from previous tests." [cookbooks] - 10https://gerrit.wikimedia.org/r/1170160 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:02:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [08:02:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [08:10:26] Hmm, hnowlan you taking a look or should I? ^ [08:18:30] (03PS1) 10Marostegui: db2221: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172588 (https://phabricator.wikimedia.org/T399955) [08:21:10] (03CR) 10Marostegui: [C:03+2] db2221: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172588 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [08:24:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2221.codfw.wmnet with reason: Maintenance [08:24:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2221 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79899 and previous config saved to /var/cache/conftool/dbconfig/20250725-082430-marostegui.json [08:29:26] (03CR) 10Stevemunene: [C:03+2] dns: Add a VIP for dse-k8s-ctrl.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1171592 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [08:29:35] !log stevemunene@dns1004 START - running authdns-update [08:30:44] !log stevemunene@dns1004 END - running authdns-update [08:31:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79900 and previous config saved to /var/cache/conftool/dbconfig/20250725-083158-root.json [08:41:21] tumbor should recover, just a couple pods wedged on tiff conversions once again [08:42:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [08:42:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [08:47:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79901 and previous config saved to /var/cache/conftool/dbconfig/20250725-084703-root.json [09:00:05] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM ❤️" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172527 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [09:02:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79902 and previous config saved to /var/cache/conftool/dbconfig/20250725-090209-root.json [09:06:34] 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11033933 (10MatthewVernon) 05Open→03Stalled [09:09:37] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Undeleted file is an incorrect version - https://phabricator.wikimedia.org/T399892#11033951 (10MatthewVernon) @hinnk are you happy to upload the version @jcrespo regenerated, please? [09:11:12] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172527 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [09:13:04] (03Merged) 10jenkins-bot: ml-services: update RRML image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172527 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [09:15:01] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:17:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79903 and previous config saved to /var/cache/conftool/dbconfig/20250725-091715-root.json [09:25:21] (03CR) 10Btullis: "Both of these changes look good by themselves, but I don't think that you need to add the hieradata/common/kubernetes.yaml file in order t" [puppet] - 10https://gerrit.wikimedia.org/r/1171584 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [09:39:35] (03PS1) 10Clément Goubert: api-gateway: Conditional restbase compatibility headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172599 (https://phabricator.wikimedia.org/T400346) [09:45:51] (03PS2) 10Fabfur: haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) [09:54:32] (03CR) 10FNegri: [C:03+1] wmcs: Update URL in comment in maintain_dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/1172434 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis) [10:00:03] (03PS3) 10Stevemunene: dse-k8s: deploy etcd service [puppet] - 10https://gerrit.wikimedia.org/r/1171584 (https://phabricator.wikimedia.org/T397293) [10:03:22] PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 157317 MB (4% inode=99%): /var/lib/hadoop/data/g 153386 MB (4% inode=99%): /var/lib/hadoop/data/j 152396 MB (4% inode=99%): /var/lib/hadoop/data/c 149715 MB (3% inode=99%): /var/lib/hadoop/data/b 153699 MB (4% inode=99%): /var/lib/hadoop/data/l 159599 MB (4% inode=99%): /var/lib/hadoop/data/k 155068 MB (4% inode=99%): /var/lib/hadoop/data [10:03:22] 9 MB (4% inode=99%): /var/lib/hadoop/data/i 153737 MB (4% inode=99%): /var/lib/hadoop/data/m 157442 MB (4% inode=99%): /var/lib/hadoop/data/d 153546 MB (4% inode=99%): /var/lib/hadoop/data/h 151924 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [10:10:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2147.codfw.wmnet with reason: Maintenance [10:10:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T399728)', diff saved to https://phabricator.wikimedia.org/P79904 and previous config saved to /var/cache/conftool/dbconfig/20250725-101017-fceratto.json [10:10:22] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:15:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T399728)', diff saved to https://phabricator.wikimedia.org/P79905 and previous config saved to /var/cache/conftool/dbconfig/20250725-101536-fceratto.json [10:15:41] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:18:38] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11034195 (10SD0001) NDA previously signed in T374998. [10:22:20] (03PS2) 10Clément Goubert: api-gateway: Conditional restbase compatibility headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172599 (https://phabricator.wikimedia.org/T400346) [10:30:39] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 8657 [10:30:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P79906 and previous config saved to /var/cache/conftool/dbconfig/20250725-103043-fceratto.json [10:31:13] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8657 [10:43:22] PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 155101 MB (4% inode=99%): /var/lib/hadoop/data/g 151115 MB (4% inode=99%): /var/lib/hadoop/data/j 152879 MB (4% inode=99%): /var/lib/hadoop/data/c 149931 MB (3% inode=99%): /var/lib/hadoop/data/b 150314 MB (4% inode=99%): /var/lib/hadoop/data/l 157193 MB (4% inode=99%): /var/lib/hadoop/data/k 154691 MB (4% inode=99%): /var/lib/hadoop/data [10:43:22] 3 MB (4% inode=99%): /var/lib/hadoop/data/i 151393 MB (4% inode=99%): /var/lib/hadoop/data/m 157679 MB (4% inode=99%): /var/lib/hadoop/data/d 152046 MB (4% inode=99%): /var/lib/hadoop/data/h 149464 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [10:45:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P79907 and previous config saved to /var/cache/conftool/dbconfig/20250725-104551-fceratto.json [10:49:39] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1171584 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [10:51:33] (03PS2) 10Btullis: analytics: Remove rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [10:52:05] (03CR) 10CI reject: [V:04-1] analytics: Remove rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [10:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250725T0700) [11:00:04] jelto, arnoldokoth, and mutante: #bothumor I � Unicode. All rise for GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250725T1100). [11:00:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T399728)', diff saved to https://phabricator.wikimedia.org/P79908 and previous config saved to /var/cache/conftool/dbconfig/20250725-110058-fceratto.json [11:01:04] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:01:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [11:01:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T399728)', diff saved to https://phabricator.wikimedia.org/P79909 and previous config saved to /var/cache/conftool/dbconfig/20250725-110121-fceratto.json [11:02:45] (03PS1) 10Hnowlan: thumbor: be more specific in alert message [alerts] - 10https://gerrit.wikimedia.org/r/1172615 [11:05:52] (03CR) 10Clément Goubert: [C:03+1] thumbor: be more specific in alert message [alerts] - 10https://gerrit.wikimedia.org/r/1172615 (owner: 10Hnowlan) [11:06:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T399728)', diff saved to https://phabricator.wikimedia.org/P79910 and previous config saved to /var/cache/conftool/dbconfig/20250725-110623-fceratto.json [11:06:28] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:10:00] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v0.10.2 - cmooney@cumin1003 [11:10:17] (03CR) 10Effie Mouzeli: [C:03+1] thumbor: be more specific in alert message [alerts] - 10https://gerrit.wikimedia.org/r/1172615 (owner: 10Hnowlan) [11:10:55] (03CR) 10Hnowlan: [C:03+2] thumbor: be more specific in alert message [alerts] - 10https://gerrit.wikimedia.org/r/1172615 (owner: 10Hnowlan) [11:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:12:06] (03Merged) 10jenkins-bot: thumbor: be more specific in alert message [alerts] - 10https://gerrit.wikimedia.org/r/1172615 (owner: 10Hnowlan) [11:12:07] (03PS3) 10Btullis: analytics: Remove rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [11:12:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v0.10.2 - cmooney@cumin1003 [11:13:02] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6417/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [11:16:09] (03PS4) 10Btullis: analytics: Remove rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [11:16:34] (03CR) 10CI reject: [V:04-1] analytics: Remove rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [11:16:58] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6418/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [11:18:44] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11034336 (10Fabfur) The image used for debci building (bullseye) is still affected by the bullseye-backports issue: ` ~ > podman run -it --rm docker-... [11:21:00] (03PS5) 10Btullis: analytics: Remove rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [11:21:20] (03CR) 10Stevemunene: [C:03+2] dse-k8s: deploy etcd service [puppet] - 10https://gerrit.wikimedia.org/r/1171584 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [11:21:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P79911 and previous config saved to /var/cache/conftool/dbconfig/20250725-112130-fceratto.json [11:21:48] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6419/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [11:33:52] (03PS1) 10Stevemunene: Add dse-k8s-codfw site [puppet] - 10https://gerrit.wikimedia.org/r/1172617 (https://phabricator.wikimedia.org/T397293) [11:35:31] (03PS2) 10Stevemunene: Add dse-k8s-codfw site definition [puppet] - 10https://gerrit.wikimedia.org/r/1172617 (https://phabricator.wikimedia.org/T397293) [11:35:46] (03CR) 10Btullis: [C:03+1] Add dse-k8s-codfw site definition [puppet] - 10https://gerrit.wikimedia.org/r/1172617 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [11:36:31] (03CR) 10Stevemunene: [C:03+2] Add dse-k8s-codfw site definition [puppet] - 10https://gerrit.wikimedia.org/r/1172617 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [11:36:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P79912 and previous config saved to /var/cache/conftool/dbconfig/20250725-113638-fceratto.json [11:38:09] (03CR) 10Btullis: [V:03+1 C:03+2] analytics: Remove rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [11:39:41] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v0.10.2 - cmooney@cumin1003 [11:42:05] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v0.10.2 - cmooney@cumin1003 [11:51:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T399728)', diff saved to https://phabricator.wikimedia.org/P79913 and previous config saved to /var/cache/conftool/dbconfig/20250725-115145-fceratto.json [11:51:51] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:52:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [11:52:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T399728)', diff saved to https://phabricator.wikimedia.org/P79914 and previous config saved to /var/cache/conftool/dbconfig/20250725-115208-fceratto.json [11:52:59] (03PS1) 10Stevemunene: Add dse-k8s-codfw etcd configuration [puppet] - 10https://gerrit.wikimedia.org/r/1172619 (https://phabricator.wikimedia.org/T397293) [11:57:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T399728)', diff saved to https://phabricator.wikimedia.org/P79915 and previous config saved to /var/cache/conftool/dbconfig/20250725-115712-fceratto.json [11:57:18] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:12:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P79916 and previous config saved to /var/cache/conftool/dbconfig/20250725-121219-fceratto.json [12:25:54] FIRING: [4x] KubernetesAPILatency: High Kubernetes API latency (LIST configmaps) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:27:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P79917 and previous config saved to /var/cache/conftool/dbconfig/20250725-122727-fceratto.json [12:29:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:30:54] RESOLVED: [4x] KubernetesAPILatency: High Kubernetes API latency (LIST configmaps) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:37:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:40:00] (03PS1) 10Gkyziridis: ml-services: Deploy revertrisk-language-agnostic latest published image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172622 (https://phabricator.wikimedia.org/T400266) [12:40:24] (03PS2) 10Gkyziridis: ml-services: Deploy revertrisk-language-agnostic latest published image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172622 (https://phabricator.wikimedia.org/T400266) [12:41:31] (03CR) 10Xcollazo: analytics: Remove rsync scripts that import Dumps 1 XML into HDFS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [12:42:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T399728)', diff saved to https://phabricator.wikimedia.org/P79918 and previous config saved to /var/cache/conftool/dbconfig/20250725-124234-fceratto.json [12:42:40] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:42:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2199.codfw.wmnet with reason: Maintenance [12:44:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:45:02] (03CR) 10Btullis: "The commit message doesn't match the content of the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1172619 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [12:45:43] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2206.codfw.wmnet with reason: Maintenance [12:45:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T399728)', diff saved to https://phabricator.wikimedia.org/P79919 and previous config saved to /var/cache/conftool/dbconfig/20250725-124550-fceratto.json [12:50:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T399728)', diff saved to https://phabricator.wikimedia.org/P79920 and previous config saved to /var/cache/conftool/dbconfig/20250725-125058-fceratto.json [12:51:05] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:51:52] (03PS61) 10Arnaudb: gerrit: Bugfixes - dry run tests [cookbooks] - 10https://gerrit.wikimedia.org/r/1170160 (https://phabricator.wikimedia.org/T387833) [12:52:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:59:35] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11034559 (10Jclark-ctr) All devices in Row D are currently connected with single power until the order of longer power cables arrives. Yesterday and today, I ran two new console cables to the... [13:00:29] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11034572 (10Jclark-ctr) a:05Jclark-ctr→03BTullis @BTullis please assign back to me when i am able to replace drive [13:02:47] (03PS2) 10Stevemunene: dse-k8s: Add dse-k8s-codfw k8s configuration [puppet] - 10https://gerrit.wikimedia.org/r/1172619 (https://phabricator.wikimedia.org/T397293) [13:05:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:06:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P79921 and previous config saved to /var/cache/conftool/dbconfig/20250725-130606-fceratto.json [13:09:36] (03CR) 10Btullis: [C:03+1] dse-k8s: Add dse-k8s-codfw k8s configuration [puppet] - 10https://gerrit.wikimedia.org/r/1172619 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [13:15:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:21:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P79922 and previous config saved to /var/cache/conftool/dbconfig/20250725-132113-fceratto.json [13:21:50] (03PS1) 10Marostegui: tables-catalog.yaml: Mark discussiontools_items as private [puppet] - 10https://gerrit.wikimedia.org/r/1172628 (https://phabricator.wikimedia.org/T400420) [13:22:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:22:40] (03CR) 10Majavah: [C:03+1] tables-catalog.yaml: Mark discussiontools_items as private [puppet] - 10https://gerrit.wikimedia.org/r/1172628 (https://phabricator.wikimedia.org/T400420) (owner: 10Marostegui) [13:24:16] (03CR) 10Marostegui: [C:03+2] tables-catalog.yaml: Mark discussiontools_items as private [puppet] - 10https://gerrit.wikimedia.org/r/1172628 (https://phabricator.wikimedia.org/T400420) (owner: 10Marostegui) [13:26:21] 10SRE-SLO, 10Observability-Metrics: Clear & Backfill Tonecheck Pyrra Metrics - https://phabricator.wikimedia.org/T400071#11034605 (10herron) Seeing some success with the prometheus compactor and sidecar workaround. I've been able to upload backfilled blocks to Thanos in a way that at least partially works. F... [13:28:34] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:30:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:31:50] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:32:10] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:32:16] (03PS1) 10Clément Goubert: python-build: Bump bullseye changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172631 [13:32:22] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:32:36] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:32:59] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:33:46] (03CR) 10Ayounsi: [C:03+1] python-build: Bump bullseye changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172631 (owner: 10Clément Goubert) [13:34:03] (03CR) 10Clément Goubert: [V:03+2 C:03+2] python-build: Bump bullseye changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172631 (owner: 10Clément Goubert) [13:34:22] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:35:02] (03CR) 10MVernon: "I don't know how picky our machinery is, but I think changelog entries have to start with a * to be valid..." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172631 (owner: 10Clément Goubert) [13:36:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T399728)', diff saved to https://phabricator.wikimedia.org/P79924 and previous config saved to /var/cache/conftool/dbconfig/20250725-133621-fceratto.json [13:36:26] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:36:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2210.codfw.wmnet with reason: Maintenance [13:36:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T399728)', diff saved to https://phabricator.wikimedia.org/P79925 and previous config saved to /var/cache/conftool/dbconfig/20250725-133644-fceratto.json [13:40:38] (03PS1) 10Clément Goubert: python3-build-bullseye: fix changelog entry [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172636 [13:41:32] (03PS1) 10Federico Ceratto: zarcillo: Add egress to Netbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) [13:41:32] (03CR) 10Federico Ceratto: "Add egress to Netbox to Zarcillo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [13:41:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T399728)', diff saved to https://phabricator.wikimedia.org/P79926 and previous config saved to /var/cache/conftool/dbconfig/20250725-134145-fceratto.json [13:41:51] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:47:42] (03PS1) 10Majavah: wikireplicas: maintain-views: Allow running without replacing views [puppet] - 10https://gerrit.wikimedia.org/r/1172638 [13:53:53] (03CR) 10FNegri: [C:03+1] "this is MUCH more useful than the Y/N prompt that I never ever used." [puppet] - 10https://gerrit.wikimedia.org/r/1172638 (owner: 10Majavah) [13:55:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:56:47] (03CR) 10Clément Goubert: [C:04-1] "Addresses are not right." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [13:56:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P79928 and previous config saved to /var/cache/conftool/dbconfig/20250725-135653-fceratto.json [14:00:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-4gnv2 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [14:02:31] (03PS1) 10Marostegui: tables-catalog.yaml: Make discussion tables partially public [puppet] - 10https://gerrit.wikimedia.org/r/1172642 (https://phabricator.wikimedia.org/T400420) [14:03:21] (03CR) 10Majavah: [C:03+1] tables-catalog.yaml: Make discussion tables partially public [puppet] - 10https://gerrit.wikimedia.org/r/1172642 (https://phabricator.wikimedia.org/T400420) (owner: 10Marostegui) [14:05:05] (03CR) 10Marostegui: [C:03+2] tables-catalog.yaml: Make discussion tables partially public [puppet] - 10https://gerrit.wikimedia.org/r/1172642 (https://phabricator.wikimedia.org/T400420) (owner: 10Marostegui) [14:07:19] (03CR) 10Majavah: [C:03+2] wikireplicas: maintain-views: Allow running without replacing views [puppet] - 10https://gerrit.wikimedia.org/r/1172638 (owner: 10Majavah) [14:09:12] (03PS1) 10Ayounsi: Release v0.10.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1172644 [14:09:47] (03CR) 10Cathal Mooney: [C:03+1] Release v0.10.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1172644 (owner: 10Ayounsi) [14:10:30] (03CR) 10Ayounsi: [V:03+2 C:03+2] Release v0.10.2 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1172644 (owner: 10Ayounsi) [14:12:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P79929 and previous config saved to /var/cache/conftool/dbconfig/20250725-141201-fceratto.json [14:12:37] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11034774 (10Andrew) I broke the cluster again, but now it's working. The main thing I did was a version... [14:13:57] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:17:43] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v0.10.2 - cmooney@cumin1003 [14:20:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v0.10.2 - cmooney@cumin1003 [14:22:28] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11034824 (10Scott_French) Thanks, @Fabfur - If this is urgent, a manual rebuild of the [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dock... [14:27:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T399728)', diff saved to https://phabricator.wikimedia.org/P79930 and previous config saved to /var/cache/conftool/dbconfig/20250725-142708-fceratto.json [14:27:14] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:27:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [14:27:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T399728)', diff saved to https://phabricator.wikimedia.org/P79931 and previous config saved to /var/cache/conftool/dbconfig/20250725-142730-fceratto.json [14:31:44] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11034843 (10Fabfur) Thanks a lot @Scott_French ! This weekend is fine, I'll retry on Monday! [14:32:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T399728)', diff saved to https://phabricator.wikimedia.org/P79932 and previous config saved to /var/cache/conftool/dbconfig/20250725-143238-fceratto.json [14:32:45] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:35:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-4gnv2 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [14:43:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance [14:43:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T399249)', diff saved to https://phabricator.wikimedia.org/P79933 and previous config saved to /var/cache/conftool/dbconfig/20250725-144329-marostegui.json [14:43:35] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [14:43:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T399249)', diff saved to https://phabricator.wikimedia.org/P79934 and previous config saved to /var/cache/conftool/dbconfig/20250725-144344-marostegui.json [14:44:08] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11034873 (10herron) T400071#11034605 steps through a backfill process with an ad-hoc prometheus (and ad-hoc sidecar) that worked to upload backf... [14:47:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P79936 and previous config saved to /var/cache/conftool/dbconfig/20250725-144746-fceratto.json [14:47:52] (03PS1) 10Andrew Bogott: neutron metadata agent: restart service daily [puppet] - 10https://gerrit.wikimedia.org/r/1172656 (https://phabricator.wikimedia.org/T395742) [14:48:17] (03CR) 10CI reject: [V:04-1] neutron metadata agent: restart service daily [puppet] - 10https://gerrit.wikimedia.org/r/1172656 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [14:48:53] (03PS2) 10Andrew Bogott: neutron metadata agent: restart service daily [puppet] - 10https://gerrit.wikimedia.org/r/1172656 (https://phabricator.wikimedia.org/T395742) [14:49:18] (03CR) 10CI reject: [V:04-1] neutron metadata agent: restart service daily [puppet] - 10https://gerrit.wikimedia.org/r/1172656 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [14:50:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:03] (03PS3) 10Andrew Bogott: neutron metadata agent: restart service daily [puppet] - 10https://gerrit.wikimedia.org/r/1172656 (https://phabricator.wikimedia.org/T395742) [14:51:13] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1172656 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [14:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:54:33] (03PS4) 10Andrew Bogott: neutron metadata agent: restart service daily [puppet] - 10https://gerrit.wikimedia.org/r/1172656 (https://phabricator.wikimedia.org/T395742) [14:54:37] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1172656 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [14:57:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:58:16] (03PS1) 10Majavah: wikireplicas: Use --replace instead of --replace-all [cookbooks] - 10https://gerrit.wikimedia.org/r/1172657 [14:58:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P79937 and previous config saved to /var/cache/conftool/dbconfig/20250725-145851-marostegui.json [15:01:57] (03CR) 10FNegri: [C:03+1] wikireplicas: Use --replace instead of --replace-all [cookbooks] - 10https://gerrit.wikimedia.org/r/1172657 (owner: 10Majavah) [15:02:19] (03PS5) 10Andrew Bogott: neutron metadata agent: restart service daily [puppet] - 10https://gerrit.wikimedia.org/r/1172656 (https://phabricator.wikimedia.org/T395742) [15:02:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P79938 and previous config saved to /var/cache/conftool/dbconfig/20250725-150253-fceratto.json [15:06:45] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1172656 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [15:06:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-bmfgq - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [15:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11034978 (10Jhancock.wm) I found a console cable we can use but I'm not sure it's going to work. I connected it to the serial port on the back of the serv... [15:11:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:21] (03PS2) 10Federico Ceratto: zarcillo: Add egress to dyna.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) [15:13:16] (03CR) 10Ssingh: [C:03+1] "Looks good, thank you for updating it. Will roll this out on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis) [15:14:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P79939 and previous config saved to /var/cache/conftool/dbconfig/20250725-151359-marostegui.json [15:15:42] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485 (10RobH) 03NEW [15:16:03] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11035008 (10RobH) [15:16:51] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11035012 (10RobH) a:03Clement_Goubert @Clement_Goubert, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) a... [15:17:24] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:18:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T399728)', diff saved to https://phabricator.wikimedia.org/P79940 and previous config saved to /var/cache/conftool/dbconfig/20250725-151801-fceratto.json [15:18:06] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:18:16] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2236.codfw.wmnet with reason: Maintenance [15:18:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2236 (T399728)', diff saved to https://phabricator.wikimedia.org/P79941 and previous config saved to /var/cache/conftool/dbconfig/20250725-151823-fceratto.json [15:19:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:21:48] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-bmfgq - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [15:22:48] (03PS1) 10Clément Goubert: deploy2003: Add to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1172660 (https://phabricator.wikimedia.org/T400485) [15:23:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T399728)', diff saved to https://phabricator.wikimedia.org/P79942 and previous config saved to /var/cache/conftool/dbconfig/20250725-152318-fceratto.json [15:23:24] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:24:35] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync [15:25:15] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [15:25:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:26:33] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [15:26:43] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2037.codfw.wmnet with OS bookworm [15:27:01] !log cwhite@cumin2002 START - Cookbook sre.hosts.move-vlan for host logstash2037 [15:29:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T399249)', diff saved to https://phabricator.wikimedia.org/P79943 and previous config saved to /var/cache/conftool/dbconfig/20250725-152906-marostegui.json [15:29:12] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:29:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2154.codfw.wmnet with reason: Maintenance [15:29:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T399249)', diff saved to https://phabricator.wikimedia.org/P79944 and previous config saved to /var/cache/conftool/dbconfig/20250725-152930-marostegui.json [15:30:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:31:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T399249)', diff saved to https://phabricator.wikimedia.org/P79945 and previous config saved to /var/cache/conftool/dbconfig/20250725-153145-marostegui.json [15:31:48] RESOLVED: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-bmfgq - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [15:31:53] 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11035093 (10Aklapper) > I don't think there's anything more I can do here, I'm afraid. @MatthewVernon: Who could, potentially? (Asking because if nobody can do anything else this ticke... [15:31:53] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [15:32:14] jhancock@cumin1003 netbox (PID 3372390) is awaiting input [15:33:01] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating es2037 to codfw - jhancock@cumin1003" [15:33:06] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating es2037 to codfw - jhancock@cumin1003" [15:33:06] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:35:28] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2037 [15:35:38] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2037 [15:37:08] 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11035127 (10Aklapper) @ttaylor: Only slightly related: Feel also encouraged to link [your LDAP account](https://ldap.toolforge.org/user/TTaylor) to [your... [15:37:25] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host logstash2037 - cwhite@cumin2002" [15:37:31] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host logstash2037 - cwhite@cumin2002" [15:37:31] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:37:31] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash2037.codfw.wmnet 130.32.192.10.in-addr.arpa 0.3.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:37:34] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash2037.codfw.wmnet 130.32.192.10.in-addr.arpa 0.3.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:37:35] !log cwhite@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host logstash2037 [15:38:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P79946 and previous config saved to /var/cache/conftool/dbconfig/20250725-153825-fceratto.json [15:38:43] !log cwhite@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logstash2037 [15:38:44] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host logstash2037 [15:39:17] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11035132 (10Jhancock.wm) @Marostegui 2037 is moved and updated. All yours! We can schedule 2038 for Monday or Tuesday if you want. [15:39:54] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:40:14] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11035133 (10Marostegui) Thanks, I will get es2038 ready by Monday [15:40:20] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11035134 (10Marostegui) [15:41:14] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:41:32] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:42:54] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:44:58] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:46:26] (03CR) 10Scott French: [C:03+1] deploy2003: Add to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1172660 (https://phabricator.wikimedia.org/T400485) (owner: 10Clément Goubert) [15:46:30] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:46:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P79947 and previous config saved to /var/cache/conftool/dbconfig/20250725-154652-marostegui.json [15:48:25] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:48:36] (03CR) 10Jasmine: [C:03+1] deploy2003: Add to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1172660 (https://phabricator.wikimedia.org/T400485) (owner: 10Clément Goubert) [15:50:09] 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11035174 (10MatthewVernon) Someone might have a copy of the original available (but it's not in our storage or backups thereof). [15:53:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P79948 and previous config saved to /var/cache/conftool/dbconfig/20250725-155333-fceratto.json [15:56:35] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2037.codfw.wmnet with reason: host reimage [16:02:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P79949 and previous config saved to /var/cache/conftool/dbconfig/20250725-160200-marostegui.json [16:03:29] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2037.codfw.wmnet with reason: host reimage [16:07:25] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:08:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T399728)', diff saved to https://phabricator.wikimedia.org/P79950 and previous config saved to /var/cache/conftool/dbconfig/20250725-160840-fceratto.json [16:08:47] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:08:57] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2237.codfw.wmnet with reason: Maintenance [16:09:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2237 (T399728)', diff saved to https://phabricator.wikimedia.org/P79951 and previous config saved to /var/cache/conftool/dbconfig/20250725-160904-fceratto.json [16:09:33] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: sync [16:09:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2009:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [16:10:17] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [16:11:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T399728)', diff saved to https://phabricator.wikimedia.org/P79952 and previous config saved to /var/cache/conftool/dbconfig/20250725-161402-fceratto.json [16:14:08] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:15:01] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:16:04] (03CR) 10Btullis: [C:03+1] Blunderbuss helm chart that works with the new Blunderbuss versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171732 (https://phabricator.wikimedia.org/T392244) (owner: 10Aleksandar Mastilovic) [16:17:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T399249)', diff saved to https://phabricator.wikimedia.org/P79953 and previous config saved to /var/cache/conftool/dbconfig/20250725-161707-marostegui.json [16:17:10] !log dancy@deploy1003 Installing scap version "4.191.0" for 2 host(s) [16:17:14] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [16:17:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2161.codfw.wmnet with reason: Maintenance [16:17:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T399249)', diff saved to https://phabricator.wikimedia.org/P79954 and previous config saved to /var/cache/conftool/dbconfig/20250725-161730-marostegui.json [16:18:10] !log dancy@deploy1003 Installation of scap version "4.191.0" completed for 2 hosts [16:18:17] jhancock@cumin1003 provision (PID 3375951) is awaiting input [16:19:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T399249)', diff saved to https://phabricator.wikimedia.org/P79955 and previous config saved to /var/cache/conftool/dbconfig/20250725-161946-marostegui.json [16:23:28] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:25:35] (03CR) 10FNegri: [C:03+1] neutron metadata agent: restart service daily [puppet] - 10https://gerrit.wikimedia.org/r/1172656 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [16:26:13] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2009:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [16:26:52] (03CR) 10Andrew Bogott: [C:03+2] neutron metadata agent: restart service daily [puppet] - 10https://gerrit.wikimedia.org/r/1172656 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [16:26:56] (03CR) 10Bking: [C:03+2] dse-k8s: Add dse-k8s-codfw k8s configuration [puppet] - 10https://gerrit.wikimedia.org/r/1172619 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [16:29:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P79956 and previous config saved to /var/cache/conftool/dbconfig/20250725-162908-fceratto.json [16:29:21] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2037.codfw.wmnet with OS bookworm [16:32:36] 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11035340 (10Aklapper) 05Stalled→03Declined Thanks. Declined. [16:32:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11035342 (10RobH) >>! In T400211#11035275, @jhathaway wrote: > Supermicro indicated that the debug output from the supplied BIOS is only outputted to COM2... [16:33:58] (03PS1) 10Bking: Revert "dse-k8s: Add dse-k8s-codfw k8s configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1172676 [16:34:05] (03CR) 10Bking: [V:03+2 C:03+2] Revert "dse-k8s: Add dse-k8s-codfw k8s configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1172676 (owner: 10Bking) [16:34:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P79957 and previous config saved to /var/cache/conftool/dbconfig/20250725-163454-marostegui.json [16:35:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link errors: ssw1-d1-codfw <-> ssw1-f1-codfw - https://phabricator.wikimedia.org/T400253#11035362 (10Jhancock.wm) i cleaned the fiber port and optic. last error was at 16:20 UTC. ish. Gonna let it sit and see if it goes off again. I'll... [16:39:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11035387 (10RobH) [16:41:19] 06SRE, 06Infrastructure-Foundations, 10netops: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205#11035388 (10cmooney) JTAC 2025-0725-789575 opened [16:44:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P79958 and previous config saved to /var/cache/conftool/dbconfig/20250725-164416-fceratto.json [16:44:30] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:50:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P79959 and previous config saved to /var/cache/conftool/dbconfig/20250725-165002-marostegui.json [16:59:00] !log dancy@deploy1003 Started scap build-images: Publishing wmf/next image [16:59:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T399728)', diff saved to https://phabricator.wikimedia.org/P79960 and previous config saved to /var/cache/conftool/dbconfig/20250725-165923-fceratto.json [16:59:29] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:59:39] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2239.codfw.wmnet with reason: Maintenance [17:02:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2240.codfw.wmnet with reason: Maintenance [17:02:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2240 (T399728)', diff saved to https://phabricator.wikimedia.org/P79961 and previous config saved to /var/cache/conftool/dbconfig/20250725-170254-fceratto.json [17:05:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T399249)', diff saved to https://phabricator.wikimedia.org/P79962 and previous config saved to /var/cache/conftool/dbconfig/20250725-170509-marostegui.json [17:05:15] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:05:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2163.codfw.wmnet with reason: Maintenance [17:05:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T399249)', diff saved to https://phabricator.wikimedia.org/P79963 and previous config saved to /var/cache/conftool/dbconfig/20250725-170532-marostegui.json [17:07:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T399249)', diff saved to https://phabricator.wikimedia.org/P79964 and previous config saved to /var/cache/conftool/dbconfig/20250725-170748-marostegui.json [17:07:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T399728)', diff saved to https://phabricator.wikimedia.org/P79965 and previous config saved to /var/cache/conftool/dbconfig/20250725-170749-fceratto.json [17:07:58] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [17:09:53] !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bullseye [17:10:08] !log dancy@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 07s) [17:10:09] 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops, 13Patch-For-Review: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#11035461 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host sretest2005.codfw.w... [17:12:15] Happy Friday! Requesting permission to deploy a Content Translation UBN [17:12:20] (03PS1) 10Ahmon Dancy: deployment_server: Add pretrain systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1172678 (https://phabricator.wikimedia.org/T398873) [17:12:38] stephanebisson: Permission granted [17:12:44] (03CR) 10CI reject: [V:04-1] deployment_server: Add pretrain systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1172678 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [17:13:08] dancy Thanks. Will be ready to do so in about 30 minutes [17:13:29] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:14:48] (03PS2) 10Ahmon Dancy: deployment_server: Add pretrain systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1172678 (https://phabricator.wikimedia.org/T398873) [17:15:37] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [17:15:46] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#11035470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm [17:21:25] (03CR) 10JHathaway: sre.hosts.provision: add custom settings for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [17:22:06] (03CR) 10JHathaway: [C:03+1] sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [17:22:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P79966 and previous config saved to /var/cache/conftool/dbconfig/20250725-172255-marostegui.json [17:26:44] (03CR) 10BryanDavis: deployment_server: Add pretrain systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1172678 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [17:26:44] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bullseye [17:26:57] 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops, 13Patch-For-Review: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#11035493 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host sretest2005.codfw.wmnet... [17:27:05] (03PS1) 10Btullis: Use 'set -o pipefail' instead of 'set -e' in wikibase scripts [dumps] - 10https://gerrit.wikimedia.org/r/1172682 (https://phabricator.wikimedia.org/T400383) [17:29:01] (03CR) 10Btullis: [C:03+2] Use 'set -o pipefail' instead of 'set -e' in wikibase scripts [dumps] - 10https://gerrit.wikimedia.org/r/1172682 (https://phabricator.wikimedia.org/T400383) (owner: 10Btullis) [17:30:16] (03CR) 10Ahmon Dancy: deployment_server: Add pretrain systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1172678 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [17:32:10] !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [17:32:27] 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops, 13Patch-For-Review: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#11035506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host sretest2005.codfw.w... [17:36:11] (03CR) 10BryanDavis: [C:03+1] deployment_server: Add pretrain systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1172678 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [17:38:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P79967 and previous config saved to /var/cache/conftool/dbconfig/20250725-173803-marostegui.json [17:38:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P79968 and previous config saved to /var/cache/conftool/dbconfig/20250725-173804-fceratto.json [17:41:26] (03CR) 10BCornwall: [V:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins) [17:42:47] (03CR) 10Cathal Mooney: zarcillo: Add egress to dyna.w.o (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [17:47:02] (03PS1) 10Bernard Wang: Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172683 [17:47:54] (03CR) 10CI reject: [V:04-1] Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172683 (owner: 10Bernard Wang) [17:53:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T399249)', diff saved to https://phabricator.wikimedia.org/P79969 and previous config saved to /var/cache/conftool/dbconfig/20250725-175310-marostegui.json [17:53:15] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:53:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2164.codfw.wmnet with reason: Maintenance [17:53:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T399249)', diff saved to https://phabricator.wikimedia.org/P79970 and previous config saved to /var/cache/conftool/dbconfig/20250725-175332-marostegui.json [17:55:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T399249)', diff saved to https://phabricator.wikimedia.org/P79971 and previous config saved to /var/cache/conftool/dbconfig/20250725-175548-marostegui.json [17:57:41] (03PS2) 10Bernard Wang: Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172683 (https://phabricator.wikimedia.org/T380515) [17:58:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172683 (https://phabricator.wikimedia.org/T380515) (owner: 10Bernard Wang) [17:58:30] (03CR) 10CI reject: [V:04-1] Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172683 (https://phabricator.wikimedia.org/T380515) (owner: 10Bernard Wang) [17:58:36] (03PS1) 10Sbisson: Change how VE mobile toolbar is overridden [extensions/ContentTranslation] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172686 (https://phabricator.wikimedia.org/T400486) [17:59:20] (03PS1) 10Btullis: Restore 'set -o pipefail' behaviour for sub-shells [dumps] - 10https://gerrit.wikimedia.org/r/1172687 (https://phabricator.wikimedia.org/T400383) [17:59:52] I'm ready to deploy my CX UBN fix if it's still ok [17:59:55] (03CR) 10Btullis: [C:03+2] Restore 'set -o pipefail' behaviour for sub-shells [dumps] - 10https://gerrit.wikimedia.org/r/1172687 (https://phabricator.wikimedia.org/T400383) (owner: 10Btullis) [18:01:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172686 (https://phabricator.wikimedia.org/T400486) (owner: 10Sbisson) [18:05:57] jhancock@cumin1003 reimage (PID 3381799) is awaiting input [18:10:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P79972 and previous config saved to /var/cache/conftool/dbconfig/20250725-181055-marostegui.json [18:13:29] (03Merged) 10jenkins-bot: Change how VE mobile toolbar is overridden [extensions/ContentTranslation] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172686 (https://phabricator.wikimedia.org/T400486) (owner: 10Sbisson) [18:13:47] 06SRE, 10SRE-Access-Requests, 06SRE Observability: Logstash access for HCoplin - https://phabricator.wikimedia.org/T400288#11035636 (10jhathaway) 05Open→03Resolved a:03jhathaway >>! In T400288#11033037, @HCoplin-WMF wrote: > I was indeed able to log in and request it that way! Thank you :) > > Ap... [18:13:58] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1172686|Change how VE mobile toolbar is overridden (T400486)]] [18:14:03] T400486: Duplicated close and forward buttons in mobile translation editor - https://phabricator.wikimedia.org/T400486 [18:14:21] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10Mail: Access Request to DMarcDigests - https://phabricator.wikimedia.org/T399976#11035641 (10jhathaway) 05Open→03Resolved a:03jhathaway [18:16:06] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1172686|Change how VE mobile toolbar is overridden (T400486)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:17:48] !log sbisson@deploy1003 sbisson: Continuing with sync [18:18:10] 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11035665 (10jhathaway) [18:20:26] (03PS1) 10JHathaway: clinic-duty: add shell account for Tajh Taylor [puppet] - 10https://gerrit.wikimedia.org/r/1172693 (https://phabricator.wikimedia.org/T400277) [18:23:08] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172686|Change how VE mobile toolbar is overridden (T400486)]] (duration: 09m 09s) [18:23:13] T400486: Duplicated close and forward buttons in mobile translation editor - https://phabricator.wikimedia.org/T400486 [18:25:48] (03CR) 10RLazarus: [C:03+1] clinic-duty: add shell account for Tajh Taylor [puppet] - 10https://gerrit.wikimedia.org/r/1172693 (https://phabricator.wikimedia.org/T400277) (owner: 10JHathaway) [18:26:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P79973 and previous config saved to /var/cache/conftool/dbconfig/20250725-182603-marostegui.json [18:26:23] (03CR) 10JHathaway: [C:03+2] clinic-duty: add shell account for Tajh Taylor [puppet] - 10https://gerrit.wikimedia.org/r/1172693 (https://phabricator.wikimedia.org/T400277) (owner: 10JHathaway) [18:35:11] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [18:35:28] 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops, 13Patch-For-Review: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#11035699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host sretest2005.codfw.wmnet... [18:38:51] 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11035704 (10jhathaway) @ttaylor access should be setup, you should receive an email about setting up your kerberos credentials. Please try everything out... [18:41:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T399249)', diff saved to https://phabricator.wikimedia.org/P79974 and previous config saved to /var/cache/conftool/dbconfig/20250725-184110-marostegui.json [18:41:16] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:41:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2166.codfw.wmnet with reason: Maintenance [18:41:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T399249)', diff saved to https://phabricator.wikimedia.org/P79975 and previous config saved to /var/cache/conftool/dbconfig/20250725-184133-marostegui.json [18:43:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T399249)', diff saved to https://phabricator.wikimedia.org/P79976 and previous config saved to /var/cache/conftool/dbconfig/20250725-184349-marostegui.json [18:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:58:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P79977 and previous config saved to /var/cache/conftool/dbconfig/20250725-185855-marostegui.json [19:04:08] (03CR) 10VolkerE: "The test failure seems unrelatedly connected to I8770dd6c639" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172683 (https://phabricator.wikimedia.org/T380515) (owner: 10Bernard Wang) [19:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:12:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11035837 (10VRiley-WMF) While attempting to image this server (clouddb1022) and got this error. {F65673966} [19:14:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P79978 and previous config saved to /var/cache/conftool/dbconfig/20250725-191403-marostegui.json [19:29:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T399249)', diff saved to https://phabricator.wikimedia.org/P79979 and previous config saved to /var/cache/conftool/dbconfig/20250725-192910-marostegui.json [19:29:16] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [19:29:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [19:29:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T399249)', diff saved to https://phabricator.wikimedia.org/P79980 and previous config saved to /var/cache/conftool/dbconfig/20250725-192933-marostegui.json [19:31:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T399249)', diff saved to https://phabricator.wikimedia.org/P79981 and previous config saved to /var/cache/conftool/dbconfig/20250725-193149-marostegui.json [19:33:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11035891 (10jhathaway) @SD0001 would you kindly post a gerrit patch with your ssh public key, as a way to verify it, outside of this ticket? [19:35:01] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11035893 (10jhathaway) @ahoelzl would you kindly approve @SD0001's access request? [19:46:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P79982 and previous config saved to /var/cache/conftool/dbconfig/20250725-194657-marostegui.json [20:02:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P79983 and previous config saved to /var/cache/conftool/dbconfig/20250725-200204-marostegui.json [20:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:12:16] (03PS1) 10Arlolra: Deploy Parsoid Read Views to 39 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172711 (https://phabricator.wikimedia.org/T400510) [20:17:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T399249)', diff saved to https://phabricator.wikimedia.org/P79984 and previous config saved to /var/cache/conftool/dbconfig/20250725-201711-marostegui.json [20:17:17] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [20:17:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2181.codfw.wmnet with reason: Maintenance [20:17:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T399249)', diff saved to https://phabricator.wikimedia.org/P79985 and previous config saved to /var/cache/conftool/dbconfig/20250725-201735-marostegui.json [20:19:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T399249)', diff saved to https://phabricator.wikimedia.org/P79986 and previous config saved to /var/cache/conftool/dbconfig/20250725-201951-marostegui.json [20:34:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P79987 and previous config saved to /var/cache/conftool/dbconfig/20250725-203458-marostegui.json [20:50:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P79988 and previous config saved to /var/cache/conftool/dbconfig/20250725-205005-marostegui.json [21:05:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T399249)', diff saved to https://phabricator.wikimedia.org/P79989 and previous config saved to /var/cache/conftool/dbconfig/20250725-210513-marostegui.json [21:05:19] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [21:05:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2195.codfw.wmnet with reason: Maintenance [21:05:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T399249)', diff saved to https://phabricator.wikimedia.org/P79990 and previous config saved to /var/cache/conftool/dbconfig/20250725-210536-marostegui.json [21:06:33] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2036.codfw.wmnet with OS bookworm [21:07:03] !log cwhite@cumin2002 START - Cookbook sre.hosts.move-vlan for host logstash2036 [21:07:36] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [21:07:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T399249)', diff saved to https://phabricator.wikimedia.org/P79991 and previous config saved to /var/cache/conftool/dbconfig/20250725-210752-marostegui.json [21:13:11] cwhite@cumin2002 reimage (PID 2623349) is awaiting input [21:15:14] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host logstash2036 - cwhite@cumin2002" [21:15:19] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host logstash2036 - cwhite@cumin2002" [21:15:19] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:15:20] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash2036.codfw.wmnet 54.16.192.10.in-addr.arpa 4.5.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:15:23] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash2036.codfw.wmnet 54.16.192.10.in-addr.arpa 4.5.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:15:24] !log cwhite@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host logstash2036 [21:18:26] cwhite@cumin2002 reimage (PID 2623349) is awaiting input [21:20:26] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm [21:20:35] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#11036055 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host sretest2006.codfw.wmnet with OS bookworm executed with errors: - sretest2006 (... [21:23:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P79992 and previous config saved to /var/cache/conftool/dbconfig/20250725-212259-marostegui.json [21:23:48] !log cwhite@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logstash2036 [21:23:48] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host logstash2036 [21:38:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P79993 and previous config saved to /var/cache/conftool/dbconfig/20250725-213806-marostegui.json [21:41:18] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2036.codfw.wmnet with reason: host reimage [21:44:03] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2036.codfw.wmnet with reason: host reimage [21:53:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T399249)', diff saved to https://phabricator.wikimedia.org/P79994 and previous config saved to /var/cache/conftool/dbconfig/20250725-215314-marostegui.json [21:53:20] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [21:53:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2198.codfw.wmnet with reason: Maintenance [22:07:59] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2036.codfw.wmnet with OS bookworm [22:10:37] PROBLEM - Host db2196 #page is DOWN: PING CRITICAL - Packet loss = 100% [22:10:51] looking [22:11:08] acking [22:11:22] o/ [22:11:43] seemingly not pooled? [22:12:19] ah, wait never mind ... [22:12:35] last SAL entry was to promote it to x1 primary last month [22:12:36] x1 primary =/ [22:12:52] PROBLEM - MariaDB Replica IO: x1 on db2197 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2196.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2196.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:12:57] PROBLEM - MariaDB Replica IO: x1 #page on db2231 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2196.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2196.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:13:01] PROBLEM - MariaDB Replica IO: x1 #page on db2215 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2196.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2196.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:13:09] PROBLEM - MariaDB Replica IO: x1 #page on db2191 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2196.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2196.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:13:09] that checks out ^ [22:13:16] egh eyah [22:13:17] *yeah [22:13:23] PROBLEM - MariaDB Replica IO: x1 #page on db2186 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2196.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2196.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:13:24] PROBLEM - MariaDB Replica IO: x1 on db2201 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2196.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2196.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:13:28] I think it's "wake up a DBA" territory unfortunately [22:13:39] agreed [22:13:40] 100% [22:13:57] checking codfw application layer to see how sad things are [22:14:02] 12:13 AM for both Manuel and Amir, I guess Amir's probably the one more likely to be awake [22:14:17] oh Amir's on vacation though, Manuel it is [22:14:55] thanks for handling that [22:15:05] mw-web is ... curiously looking alright in codfw [22:15:43] !incidents [22:15:43] 6497 (ACKED) Host db2196 (paged) [22:15:44] 6498 (ACKED) db2231 (paged)/MariaDB Replica IO: x1 (paged) [22:15:44] 6499 (ACKED) db2215 (paged)/MariaDB Replica IO: x1 (paged) [22:15:44] 6500 (ACKED) db2191 (paged)/MariaDB Replica IO: x1 (paged) [22:15:44] 6501 (ACKED) db2186 (paged)/MariaDB Replica IO: x1 (paged) [22:17:13] he's on his way [22:17:28] thank you, rzl! [22:17:57] I think I jinxed things with my handoff, heh [22:18:01] Hey [22:18:53] marostegui: hey -- db2196 is unreachable, x1 codfw master [22:18:56] * urandom is late to the party [22:18:57] marostegui: let me know if you'd like a summary or need additional hands [22:19:28] Anyone tried a manually boot through idrac? [22:19:42] I have not [22:19:48] ok I will do that [22:20:16] I'm here too if an additional set of hands will help [22:20:19] PROBLEM - MariaDB Replica Lag: x1 #page on db2215 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 650.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:23] PROBLEM - MariaDB Replica Lag: x1 #page on db2231 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 653.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:27] PROBLEM - MariaDB Replica Lag: x1 #page on db2191 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 656.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:28] PROBLEM - MariaDB Replica Lag: x1 #page on db2186 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 656.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:35] !incidents [22:20:36] 6497 (ACKED) Host db2196 (paged) [22:20:36] 6498 (ACKED) db2231 (paged)/MariaDB Replica IO: x1 (paged) [22:20:36] 6499 (ACKED) db2215 (paged)/MariaDB Replica IO: x1 (paged) [22:20:36] 6500 (ACKED) db2191 (paged)/MariaDB Replica IO: x1 (paged) [22:20:37] 6501 (ACKED) db2186 (paged)/MariaDB Replica IO: x1 (paged) [22:20:37] 6502 (ACKED) db2215 (paged)/MariaDB Replica Lag: x1 (paged) [22:20:37] 6503 (UNACKED) db2231 (paged)/MariaDB Replica Lag: x1 (paged) [22:20:38] 6504 (UNACKED) db2191 (paged)/MariaDB Replica Lag: x1 (paged) [22:20:38] 6505 (UNACKED) db2186 (paged)/MariaDB Replica Lag: x1 (paged) [22:20:46] !ack 6503 [22:20:47] 6503 (ACKED) db2231 (paged)/MariaDB Replica Lag: x1 (paged) [22:20:48] !ack 6504 [22:20:49] 6504 (ACKED) db2191 (paged)/MariaDB Replica Lag: x1 (paged) [22:20:50] !ack 6505 [22:20:51] 6505 (ACKED) db2186 (paged)/MariaDB Replica Lag: x1 (paged) [22:21:29] it is booting now [22:21:32] let's see [22:21:38] if it finishes [22:22:23] is there any task for this incident? [22:22:42] will start one now [22:23:03] ok [22:23:10] shall we create this as an incident? [22:23:13] RECOVERY - Host db2196 #page is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [22:23:22] oh, that looks promising [22:23:32] the host is back, I am going to recover it and then do an emergency switchover [22:23:45] there are no traces of issues on the idrac logs, so I don't trust this host anymore [22:23:46] bare-bones task at https://phabricator.wikimedia.org/T400513 [22:23:49] especially not for the weekend [22:24:07] Is this likely to be why I've had a repeatedly-failing jenkins test ("CentralAuth SharedDomainHookHandlerTest::testOnAuthManagerVerifyAuthentication_multiStep) for the last hour-or-so, or is it probably unrelated? [22:24:23] Kemayo: unrelated unless it started at 22:10 UTC [22:24:37] PROBLEM - MariaDB Replica IO: x1 #page on db2196 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:24:45] !ack 6506 [22:24:45] PROBLEM - MariaDB read only x1 #page on db2196 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [22:24:45] 6506 (ACKED) db2196 (paged)/MariaDB Replica IO: x1 (paged) [22:25:00] !ack 6507 [22:25:00] 6507 (ACKED) db2196 (paged)/MariaDB read only x1 (paged) [22:25:06] rzl: a bit longer than that, so I suppose not,. [22:25:37] RECOVERY - MariaDB Replica IO: x1 #page on db2196 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:47] RECOVERY - MariaDB read only x1 #page on db2196 is OK: Version 10.11.13-MariaDB-log, Uptime 74s, read_only: True, event_scheduler: True, 94.86 QPS, connection latency: 0.029006s, query latency: 0.000852s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [22:25:52] RECOVERY - MariaDB Replica IO: x1 on db2197 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:57] RECOVERY - MariaDB Replica IO: x1 #page on db2231 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:01] RECOVERY - MariaDB Replica IO: x1 #page on db2215 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:09] RECOVERY - MariaDB Replica IO: x1 #page on db2191 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:19] RECOVERY - MariaDB Replica Lag: x1 #page on db2215 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:23] RECOVERY - MariaDB Replica IO: x1 #page on db2186 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:24] RECOVERY - MariaDB Replica Lag: x1 #page on db2231 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:27] RECOVERY - MariaDB Replica Lag: x1 #page on db2186 is OK: OK slave_sql_lag Replication lag: 0.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:28] RECOVERY - MariaDB Replica Lag: x1 #page on db2191 is OK: OK slave_sql_lag Replication lag: 0.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:28] RECOVERY - MariaDB Replica IO: x1 on db2201 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:35] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2215 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1172721 (https://phabricator.wikimedia.org/T400514) [22:27:49] ok, x1 codfw master caught up [22:27:54] running the emergency switchover right now [22:28:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: Primary switchover x1 T400514 [22:28:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2215 with weight 0 T400514', diff saved to https://phabricator.wikimedia.org/P79995 and previous config saved to /var/cache/conftool/dbconfig/20250725-222856-root.json [22:29:01] T400514: Switchover x1 master (db2196 -> db2215) - https://phabricator.wikimedia.org/T400514 [22:29:33] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2215 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1172721 (https://phabricator.wikimedia.org/T400514) (owner: 10Gerrit maintenance bot) [22:35:57] !log Starting x1 codfw failover from db2196 to db2215 - T400514 [22:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:02] T400514: Switchover x1 master (db2196 -> db2215) - https://phabricator.wikimedia.org/T400514 [22:36:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2215 to x1 primary T400514', diff saved to https://phabricator.wikimedia.org/P79996 and previous config saved to /var/cache/conftool/dbconfig/20250725-223622-marostegui.json [22:37:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2196 T400514', diff saved to https://phabricator.wikimedia.org/P79997 and previous config saved to /var/cache/conftool/dbconfig/20250725-223736-marostegui.json [22:39:33] (03PS1) 10Marostegui: db2196: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1172723 (https://phabricator.wikimedia.org/T400513) [22:40:14] (03CR) 10Marostegui: [C:03+2] db2196: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1172723 (https://phabricator.wikimedia.org/T400513) (owner: 10Marostegui) [22:41:38] marostegui: fingers-crossed your weekend ends up better than it started 🤞 [22:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172726 [23:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172726 (owner: 10TrainBranchBot) [23:40:05] (03PS2) 10Dzahn: add blubber builder config to build zoekt [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) [23:40:18] (03CR) 10CI reject: [V:04-1] add blubber builder config to build zoekt [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [23:42:02] (03PS3) 10Dzahn: add blubber builder config to build zoekt [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) [23:42:15] (03CR) 10CI reject: [V:04-1] add blubber builder config to build zoekt [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [23:50:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172726 (owner: 10TrainBranchBot)