[00:01:01] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [00:08:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1165184 [00:08:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1165184 (owner: 10TrainBranchBot) [00:09:28] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1165186 [00:09:31] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1165187 [00:13:22] (03PS1) 10Clare Ming: xLab: Deploy v0.7.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165189 [00:13:31] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:14:33] (03PS1) 10Clare Ming: xLab: Deploy v0.7.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165190 [00:15:44] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.7.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165189 (owner: 10Clare Ming) [00:16:06] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.7.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165190 (owner: 10Clare Ming) [00:17:15] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165189 (owner: 10Clare Ming) [00:17:42] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165190 (owner: 10Clare Ming) [00:18:18] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [00:19:20] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [00:19:48] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [00:20:15] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [00:29:57] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1165184 (owner: 10TrainBranchBot) [01:02:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:08:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.8 [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165198 (https://phabricator.wikimedia.org/T392178) [01:08:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.8 [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165198 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot) [01:17:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:19:24] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.8 [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165198 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot) [02:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T0200) [02:28:32] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:40:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:51:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:57:42] FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T0300) [03:01:42] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165209 (https://phabricator.wikimedia.org/T392178) [03:01:43] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165209 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot) [03:02:36] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165209 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot) [03:03:01] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.8 refs T392178 [03:03:07] T392178: 1.45.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T392178 [03:43:32] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:42] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:46:20] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:53:32] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:32] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [03:58:49] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.8 refs T392178 (duration: 55m 48s) [03:58:56] T392178: 1.45.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T392178 [04:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T0400) [04:01:44] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.5 (duration: 01m 38s) [04:13:31] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:24:57] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397983#10961832 (10phaultfinder) [04:36:20] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:38:32] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:28] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:55:20] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.234 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:08] (03CR) 10Ayounsi: [C:03+1] "I learned more about python than I was able to do a proper code review. What I understood made sens to me, but someone else should review." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [05:43:26] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10961861 (10jcrespo) > @jcrespo hey i finished testing on this server. Do you want to take it for a spin? it's the new 1CPU Config-K Sorry, I have little context of "1CPU Config-K".... [05:44:05] PROBLEM - MariaDB disk space #page on es1036 is CRITICAL: DISK CRITICAL - /run/credentials/systemd-tmpfiles-clean.service is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:44:33] looking [05:45:13] server looks fine [05:45:20] <_joe_> that seems like a red herring [05:45:58] <_joe_> !incidents [05:45:58] 6438 (UNACKED) es1036 (paged)/MariaDB disk space (paged) [05:46:05] <_joe_> !ack 6438 [05:46:06] 6438 (ACKED) es1036 (paged)/MariaDB disk space (paged) [05:46:23] <_joe_> this looks like a UBN! issue with the check for disk space [05:46:50] how does that work? is that even a mount point? [05:46:55] <_joe_> yes [05:47:04] <_joe_> ramfs on /run/credentials/systemd-tmpfiles-clean.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700) [05:47:08] <_joe_> but it's a ramfs [05:47:21] <_joe_> I guess our check is from the distant past and only checks for tmpfs :P [05:47:30] I see, but even df ignores it [05:47:37] <_joe_> yes [05:47:39] <_joe_> for good reason [05:47:51] I wonder why the check doesn't [05:47:58] <_joe_> this looks like a UBN! you can open to o11y [05:48:04] I will [05:48:37] should I retry the check to see if it is just a race condition? I don't want to disable the check [05:49:12] I will retry it at least once, I don't see the harm on it [05:50:26] (03PS4) 10Ayounsi: reimage: temporarily store the MAC in Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 [05:50:38] (03CR) 10Ayounsi: reimage: temporarily store the MAC in Netbox (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [05:50:56] weird that it is only happening on es1036 [05:51:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 23.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:51:42] you can check if there are more important things happening, I will report es1036 [05:51:51] (03PS5) 10Ayounsi: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [05:52:26] RECOVERY - MariaDB disk space #page on es1036 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:52:50] oh, so on a retry it got resolved [05:53:45] <_joe_> no I actually have other work to do [05:53:58] yeah, that works too, I meant that I will take care of it [05:54:03] <_joe_> re: check for more important things happening [05:54:17] <_joe_> the mw-api-int thing isn't worrisome per se [05:54:34] <_joe_> it's 99% the same actor as yesterday night [05:56:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.58% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:56:58] (03PS6) 10Ayounsi: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [05:57:35] (03CR) 10Ayounsi: reimage: add support for using the host UUID for DHCP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [06:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T0600) [06:00:05] marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T0600). [06:04:09] (03CR) 10CI reject: [V:04-1] reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:17:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/2 (Transport: cr2-codfw:xe-0/1/1:1 (Lumen, 442550293) {#5249}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:22:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:30:54] (03PS1) 10Kosta Harlan: UserInfoCard can unintentionally render information for more than one user [extensions/CheckUser] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165265 [06:32:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:40:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:22] (03PS3) 10Jgiannelos: mobileapps: Use profiler script to spawn profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165068 (https://phabricator.wikimedia.org/T397750) [06:55:44] (03CR) 10Arnaudb: [C:03+1] gitlab: remove git_data_dirs setting [puppet] - 10https://gerrit.wikimedia.org/r/1165033 (https://phabricator.wikimedia.org/T394382) (owner: 10Jelto) [06:55:53] (03CR) 10Jgiannelos: "I bumped the image version on staging and now it should use the profiler as a command." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165068 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [06:57:43] FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [07:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T0700) [07:00:05] kart_ and Daniuu: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] here [07:00:23] Present [07:00:32] I can deploy my patch.. [07:00:40] Starting it.. [07:00:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164948 (https://phabricator.wikimedia.org/T393705) (owner: 10KartikMistry) [07:01:47] (03Merged) 10jenkins-bot: Remove cxstats campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164948 (https://phabricator.wikimedia.org/T393705) (owner: 10KartikMistry) [07:02:26] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1164948|Remove cxstats campaign (T393705)]] [07:02:34] T393705: Remove CXStats related code - https://phabricator.wikimedia.org/T393705 [07:04:25] (03PS7) 10Daniuu: nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) [07:06:30] !log kartik@deploy1003 kartik: Backport for [[gerrit:1164948|Remove cxstats campaign (T393705)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:09:10] !log kartik@deploy1003 kartik: Continuing with sync [07:10:02] (03CR) 10Vgutierrez: hiera: Use the upload cert on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1164238 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [07:10:10] (03PS2) 10Vgutierrez: hiera: Use the upload cert on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1164238 (https://phabricator.wikimedia.org/T394484) [07:12:56] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164238 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [07:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:16:44] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164948|Remove cxstats campaign (T393705)]] (duration: 14m 17s) [07:16:51] T393705: Remove CXStats related code - https://phabricator.wikimedia.org/T393705 [07:17:53] Done [07:18:35] * Daniuu will likely need some assistance from urbanecm or awight or Amir1 [07:19:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 22.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:20:28] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2020 from Ganeti/codfw [puppet] - 10https://gerrit.wikimedia.org/r/1165042 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff) [07:28:54] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:29:24] i can deploy Daniuu's patch [07:29:36] (03CR) 10Urbanecm: [C:03+2] nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu) [07:30:41] (03Merged) 10jenkins-bot: nlwiki: add VRT agent user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165056 (https://phabricator.wikimedia.org/T398216) (owner: 10Daniuu) [07:31:29] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1165056|nlwiki: add VRT agent user group (T398216)]] [07:31:35] T398216: Create VRT user rights group for nl-wp - https://phabricator.wikimedia.org/T398216 [07:33:33] !log urbanecm@deploy1003 urbanecm, daniuu: Backport for [[gerrit:1165056|nlwiki: add VRT agent user group (T398216)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:33:43] Daniuu: can you test your patch at the debug server? [07:34:48] (03CR) 10Vgutierrez: [C:03+2] hiera: Use the upload cert on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1164238 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [07:35:17] Daniuu: to answer your question from Discord (let's please keep all conversation in here), you can access the debugserver by installing https://wikitech.wikimedia.org/wiki/WikimediaDebug extension in your browser. [07:35:30] please test the patch works as-intended on production nlwiki via the debug server [07:35:38] let me know if i can help you in any way [07:35:46] (fwiw, beta and debug server are two different contexts) [07:36:32] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti2050.codfw.wmnet with OS bookworm [07:37:42] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti5005.eqsin.wmnet with reason: reimage [07:38:00] urbanecm: looks fine [07:38:04] !log urbanecm@deploy1003 urbanecm, daniuu: Continuing with sync [07:38:05] proceeding [07:38:40] !log vgutierrez@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4045.ulsfo.wmnet [07:40:03] (03PS1) 10Muehlenhoff: Remove access for corvus [puppet] - 10https://gerrit.wikimedia.org/r/1165457 [07:41:22] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs: ntp: Automatically restart the service after config changes [puppet] - 10https://gerrit.wikimedia.org/r/1164970 (https://phabricator.wikimedia.org/T398099) (owner: 10Majavah) [07:41:28] (03CR) 10Slyngshede: [C:03+1] Remove access for corvus [puppet] - 10https://gerrit.wikimedia.org/r/1165457 (owner: 10Muehlenhoff) [07:42:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:43:11] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2004.codfw.wmnet with reason: Maintenance and reboot [07:43:20] !log vgutierrez@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4045.ulsfo.wmnet [07:43:34] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165056|nlwiki: add VRT agent user group (T398216)]] (duration: 12m 04s) [07:43:43] T398216: Create VRT user rights group for nl-wp - https://phabricator.wikimedia.org/T398216 [07:44:10] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4045 is CRITICAL: SSL CRITICAL - failed to verify wikipedia.org against upload.wikimedia.org, maps.wikimedia.org:Certificate upload.wikimedia.org SAN wikipedia.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.wikipedia.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN * [07:44:10] edia.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN wikimedia.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.wikimedia.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.m.wikimedia.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.or [07:44:10] icate upload.wikimedia.org SAN *.planet.wikimedia.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN mediawiki.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.mediawiki.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.m.mediawiki.org not found in cert SAN [07:44:10] ps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN wikibooks.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.wikibooks.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.m.wikibooks.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN [07:44:10] .org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.wikidata.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.m.wikidata.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN wikinews.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certi [07:44:10] pload.wikimedia.org SAN *.wikinews.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.m.wikinews.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN wikiquote.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.wikiquote.org not found in cert SAN list: maps.wikim [07:44:11] , upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.m.wikiquote.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN wikisource.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.wikisource.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.m.wikisource [07:44:11] found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN wikiversity.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.wikiversity.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.m.wikiversity.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Cert [07:44:12] upload.wikimedia.org SAN wikivoyage.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.wikivoyage.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.m.wikivoyage.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN wiktionary.org not found in cert SAN list: maps. [07:44:12] a.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.wiktionary.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.m.wiktionary.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN wikimediafoundation.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN [07:44:13] ediafoundation.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN wmfusercontent.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN *.wmfusercontent.org not found in cert SAN list: maps.wikimedia.org, upload.wikimedia.org:Certificate upload.wikimedia.org SAN w.wiki not found in cert SAN list: maps.wikimedia.org, upload.wik [07:44:13] rg https://wikitech.wikimedia.org/wiki/HTTPS [07:44:33] Daniuu: should be live [07:44:34] oh wow.. that's verbosity [07:44:45] expected side-effect :) [07:44:54] vgutierrez: I thought for a second that I earned myself the famous sticker [07:45:00] I was scared for a second [07:45:49] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:45:57] urbanecm: it is live, thanks for helping out so much 😂 [07:46:31] any time [07:47:51] I'll try to downtime that specific alert on upload@ulsfo to avoid flooding the channel [07:48:36] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2050.codfw.wmnet with reason: host reimage [07:52:31] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2050.codfw.wmnet with reason: host reimage [07:53:26] (03CR) 10Muehlenhoff: [C:03+2] Remove access for corvus [puppet] - 10https://gerrit.wikimedia.org/r/1165457 (owner: 10Muehlenhoff) [07:53:32] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:53:36] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:54:19] !log switching upload@ulsfo to upload TLS certificate - T394484 [07:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:24] T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484 [07:55:51] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Corvus out of all services on: 2396 hosts [07:56:25] jmm@cumin2002 reimage (PID 3630477) is awaiting input [07:58:35] !log Manually start a Growth cron job via `kubectl create job growthexperiments-deleteoldsurveys-$(date +"%Y%m%d%H%M") --from=cronjobs/growthexperiments-deleteoldsurveys` to verify whether a recent failure is permanent [07:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [08:00:05] jnuche and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T0800). [08:00:36] question...why does xLab have such a long deployment window, and what does that mean for deployers? it collides with train, so...hopefully not much? [08:00:49] (03PS1) 10Muehlenhoff: Remove access for schoenbaechler [puppet] - 10https://gerrit.wikimedia.org/r/1165459 [08:00:50] morning, I'll roll out the train in the next few minutes [08:03:56] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165460 (https://phabricator.wikimedia.org/T392178) [08:03:57] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165460 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot) [08:04:45] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165460 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot) [08:06:38] (03CR) 10Jgiannelos: "Indeed I was mislead by the response while debugging the issue. I added some more details on the ticket." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165110 (https://phabricator.wikimedia.org/T398167) (owner: 10Jgiannelos) [08:06:40] (03Abandoned) 10Jgiannelos: mobileapps: Use GET instead of POST for MW API requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165110 (https://phabricator.wikimedia.org/T398167) (owner: 10Jgiannelos) [08:07:30] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2050.codfw.wmnet with OS bookworm [08:08:34] !log installing sudo security updates [08:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5005.eqsin.wmnet with OS bookworm [08:09:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10962061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5005.eqsin.wmnet with OS bookworm [08:10:01] (03PS1) 10Aklapper: Phabricator monthly email: Rename var (now reserved word in MariaDB) [puppet] - 10https://gerrit.wikimedia.org/r/1165461 (https://phabricator.wikimedia.org/T398267) [08:11:40] (03PS1) 10Muehlenhoff: Add ganeti2050 [puppet] - 10https://gerrit.wikimedia.org/r/1165462 (https://phabricator.wikimedia.org/T396590) [08:12:14] (03CR) 10Muehlenhoff: [C:03+2] Fix firewall config for idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1161827 (owner: 10Muehlenhoff) [08:12:42] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.8 refs T392178 [08:12:49] T392178: 1.45.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T392178 [08:12:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:13:31] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:13:52] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti2050 [puppet] - 10https://gerrit.wikimedia.org/r/1165462 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff) [08:14:00] (03PS2) 10Muehlenhoff: Add ganeti2050 [puppet] - 10https://gerrit.wikimedia.org/r/1165462 (https://phabricator.wikimedia.org/T396590) [08:16:59] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1165459 (owner: 10Muehlenhoff) [08:17:02] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti2050 [puppet] - 10https://gerrit.wikimedia.org/r/1165462 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff) [08:20:56] (03PS1) 10Aklapper: Phabricator: Update recipients of quarterly metrics mail [puppet] - 10https://gerrit.wikimedia.org/r/1165464 [08:25:47] (03CR) 10Vgutierrez: [C:03+2] haproxy: remove conditionals on wikimedia_trust [puppet] - 10https://gerrit.wikimedia.org/r/1152894 (owner: 10Giuseppe Lavagetto) [08:26:02] (03PS1) 10Elukey: pyrra: disable wdqs and istio-related burrates alerts [puppet] - 10https://gerrit.wikimedia.org/r/1165465 [08:26:42] (03PS2) 10Elukey: pyrra: disable wdqs and istio-related burrates alerts [puppet] - 10https://gerrit.wikimedia.org/r/1165465 [08:28:00] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6114/co" [puppet] - 10https://gerrit.wikimedia.org/r/1165465 (owner: 10Elukey) [08:28:53] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:30:17] (03CR) 10Hnowlan: [C:03+1] mobileapps: Use profiler script to spawn profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165068 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [08:30:58] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Use profiler script to spawn profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165068 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [08:32:03] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5005.eqsin.wmnet with reason: host reimage [08:32:29] (03Merged) 10jenkins-bot: mobileapps: Use profiler script to spawn profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165068 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [08:32:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:34:35] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [08:36:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5005.eqsin.wmnet with reason: host reimage [08:38:32] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:32] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:39:37] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10962148 (10Anton.Kokh) {F62759310} Hello, Here's the key. Thank you! Anton [08:41:39] (03PS1) 10Elukey: kubernetes: update the prometheus-statsd image to Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1165467 [08:42:22] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup2004.codfw.wmnet: Renew puppet certificate - jynus@cumin1002 [08:44:08] (03PS1) 10Jgiannelos: mobileapps: Fix command/args definition for staging debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165469 (https://phabricator.wikimedia.org/T397750) [08:44:16] (03CR) 10CI reject: [V:04-1] mobileapps: Fix command/args definition for staging debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165469 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [08:44:18] (03PS2) 10Jgiannelos: mobileapps: Fix command/args definition for staging debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165469 (https://phabricator.wikimedia.org/T397750) [08:44:23] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2005.codfw.wmnet with reason: Maintenance and reboot [08:44:26] (03PS1) 10Kosta Harlan: UserInfoCard: Fix opt-in to temporary account label display [extensions/CheckUser] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165470 (https://phabricator.wikimedia.org/T395661) [08:44:51] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [08:45:41] jouncebot: nowandnext [08:45:42] For the next 11 hour(s) and 44 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [08:45:42] For the next 1 hour(s) and 14 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T0800) [08:45:42] In 1 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1000) [08:46:08] jnuche: are you done deploying? I'd like to backport two patches to wmf.8 [08:46:32] kostajh: yep, you can go ahead [08:48:01] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM. Please keep in mind this will need to be merged during a reserved Mediawiki infrastructure window, and we will have to test everythi" [puppet] - 10https://gerrit.wikimedia.org/r/1165467 (owner: 10Elukey) [08:48:20] thanks [08:48:25] (03CR) 10JMeybohm: [C:03+1] kubernetes: update the prometheus-statsd image to Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1165467 (owner: 10Elukey) [08:49:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165470 (https://phabricator.wikimedia.org/T395661) (owner: 10Kosta Harlan) [08:49:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165265 (owner: 10Kosta Harlan) [08:51:37] (03CR) 10Hnowlan: [C:03+1] mobileapps: Fix command/args definition for staging debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165469 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [08:52:02] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Fix command/args definition for staging debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165469 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [08:53:48] (03Merged) 10jenkins-bot: mobileapps: Fix command/args definition for staging debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165469 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [08:53:58] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [08:54:04] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [08:54:26] (03CR) 10Muehlenhoff: [C:03+2] Remove access for schoenbaechler [puppet] - 10https://gerrit.wikimedia.org/r/1165459 (owner: 10Muehlenhoff) [08:55:02] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [08:55:29] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [08:56:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 21.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:58:36] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10962185 (10Arnoldokoth) [09:00:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5005.eqsin.wmnet with OS bookworm [09:00:12] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10962186 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5005.eqsin.wmnet with OS bookworm completed: - ganeti5005 (**PASS*... [09:01:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.02% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:01:28] (03Merged) 10jenkins-bot: UserInfoCard: Fix opt-in to temporary account label display [extensions/CheckUser] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165470 (https://phabricator.wikimedia.org/T395661) (owner: 10Kosta Harlan) [09:01:30] (03Merged) 10jenkins-bot: UserInfoCard can unintentionally render information for more than one user [extensions/CheckUser] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165265 (owner: 10Kosta Harlan) [09:01:59] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1165470|UserInfoCard: Fix opt-in to temporary account label display (T395661)]], [[gerrit:1165265|UserInfoCard can unintentionally render information for more than one user]] [09:02:05] T395661: UserInfoCard: Indicate if a user has enabled the preference to view temporary account IPs - https://phabricator.wikimedia.org/T395661 [09:02:29] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 270721904 and 53 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:03:29] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 174168 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:04:04] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1165470|UserInfoCard: Fix opt-in to temporary account label display (T395661)]], [[gerrit:1165265|UserInfoCard can unintentionally render information for more than one user]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:05:44] !log kharlan@deploy1003 kharlan: Continuing with sync [09:08:30] (03PS2) 10Jelto: gitlab: remove git_data_dirs setting [puppet] - 10https://gerrit.wikimedia.org/r/1165033 (https://phabricator.wikimedia.org/T394382) [09:11:14] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165470|UserInfoCard: Fix opt-in to temporary account label display (T395661)]], [[gerrit:1165265|UserInfoCard can unintentionally render information for more than one user]] (duration: 09m 15s) [09:11:22] T395661: UserInfoCard: Indicate if a user has enabled the preference to view temporary account IPs - https://phabricator.wikimedia.org/T395661 [09:12:01] ok, all done. thanks [09:15:01] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:16:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:16:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:16:43] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:17:18] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup2005.codfw.wmnet: Renew puppet certificate - jynus@cumin1002 [09:17:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:18:43] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:19:01] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:21:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:21:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:21:44] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2006.codfw.wmnet with reason: Maintenance and reboot [09:22:20] (03CR) 10Hashar: [C:04-1] "We track the upstream plugin as submodules of fork of Gerrit (branch `wmf/stable-3.10`) https://gerrit.wikimedia.org/r/plugins/gitiles/ope" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1164044 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:25:34] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [09:25:37] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [09:25:44] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1165474 [09:26:47] (03PS1) 10Hnowlan: api-gateway: use more recent ratelimit image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165475 (https://phabricator.wikimedia.org/T388804) [09:27:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet [09:28:33] (03CR) 10Hashar: [C:03+2] "Sorry I have missed the updates last week. Thank you for the testing @ebomani@wikimedia.org! I am deploying the change right now 😎" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1163289 (https://phabricator.wikimedia.org/T391866) (owner: 10Jeena Huneidi) [09:29:16] (03Merged) 10jenkins-bot: Remove all references to patchdemo legacy [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1163289 (https://phabricator.wikimedia.org/T391866) (owner: 10Jeena Huneidi) [09:29:57] (03CR) 10Muehlenhoff: [C:03+2] Update server entry for idp-test in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1161828 (owner: 10Muehlenhoff) [09:30:06] 06SRE, 10MediaWiki-Uploading, 06serviceops: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10962227 (10Grand-Duc) Info: the issue is still there. I tried to upload a new version of [[ https://commons.wikimedia.org/wiki/File:M%C3%BChle_O... [09:32:55] !log hashar@deploy1003 Started deploy [gerrit/gerrit@4e671a0]: Remove all references to patchdemo legacy - T391866 [09:33:02] T391866: Decommission patchdemo-legacy.wmcloud.org - https://phabricator.wikimedia.org/T391866 [09:33:07] !log hashar@deploy1003 Finished deploy [gerrit/gerrit@4e671a0]: Remove all references to patchdemo legacy - T391866 (duration: 00m 12s) [09:34:22] (03PS5) 10Jelto: gitlab: remove git_data_dirs setting [puppet] - 10https://gerrit.wikimedia.org/r/1165033 (https://phabricator.wikimedia.org/T394382) [09:34:22] (03CR) 10Jelto: [V:03+1] "I had to move the `gitaly['configuration']` out of the exporter section to make sure this setting is also applied for instances without mo" [puppet] - 10https://gerrit.wikimedia.org/r/1165033 (https://phabricator.wikimedia.org/T394382) (owner: 10Jelto) [09:37:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet [09:41:49] (03PS1) 10Elukey: docker_registry::web: allow prod-build to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/1165477 (https://phabricator.wikimedia.org/T397696) [09:45:01] (03PS2) 10Cyndywikime: Growth: Configure higher impact module edit limits for english and test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) [09:45:23] (03CR) 10Cyndywikime: "Thanks Michael :).This patch is ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [09:49:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5005.eqsin.wmnet to cluster eqsin and group 1 [09:49:38] (03CR) 10Hnowlan: [C:03+1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1165474 (owner: 10Muehlenhoff) [09:50:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti5005.eqsin.wmnet to cluster eqsin and group 1 [09:50:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [09:50:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti5006.eqsin.wmnet [09:51:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [09:51:21] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10962325 (10ops-monitoring-bot) Draining ganeti5006.eqsin.wmnet of running VMs [09:51:35] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1165474 (owner: 10Muehlenhoff) [09:52:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [09:52:46] (03CR) 10Clément Goubert: [C:03+1] "LGTM, we can merge it in the upcoming MW infra window" [puppet] - 10https://gerrit.wikimedia.org/r/1165467 (owner: 10Elukey) [09:53:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [09:53:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 17.44% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:53:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10962329 (10ops-monitoring-bot) Draining ganeti5006.eqsin.wmnet of running VMs [09:57:12] (03CR) 10Michael Große: [C:03+1] Growth: Configure higher impact module edit limits for english and test wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [09:57:17] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup2006.codfw.wmnet: Renew puppet certificate - jynus@cumin1002 [09:57:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [09:59:06] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2007.codfw.wmnet with reason: Maintenance and reboot [10:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1000) [10:03:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 23.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:03:16] elukey: ready to merge the statsd change? [10:03:55] claime: sure! I have no idea what to check though [10:04:27] (03CR) 10Muehlenhoff: [C:03+2] acmechief: Remove idp-test2004 [puppet] - 10https://gerrit.wikimedia.org/r/1161829 (owner: 10Muehlenhoff) [10:05:30] elukey: I guess just using thanos to check statsd_exporter_lines_total climbs ok [10:05:46] (03CR) 10Elukey: [C:03+2] kubernetes: update the prometheus-statsd image to Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1165467 (owner: 10Elukey) [10:06:10] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts idp-test2004.wikimedia.org [10:07:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool pc3 T378715', diff saved to https://phabricator.wikimedia.org/P78729 and previous config saved to /var/cache/conftool/dbconfig/20250701-100729-ladsgroup.json [10:07:36] T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715 [10:08:01] elukey: https://grafana.wikimedia.org/goto/-khxTnsNg?orgId=1 [10:08:34] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Switch to 10G (T378715) [10:08:39] running puppet on deploy1003 [10:09:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 20.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:11:00] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [10:12:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet [10:15:38] claime: deploy1003 ready [10:15:58] elukey: ok give me a second [10:16:26] we should see changes better here https://grafana.wikimedia.org/goto/zB0kA7yHg?orgId=1 [10:16:33] jmm@cumin1003 decommission (PID 3989130) is awaiting input [10:16:56] yep I was about to propose something similar, +1 [10:17:30] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:17:41] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test2004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [10:17:46] !! 1+ million lines/s [10:17:49] gg mediawiki [10:17:53] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:18:45] elukey: errr [10:18:48] cat statsd-global.yaml [10:18:54] exporter: prometheus-statsd-exporter:0.26.1-2-20240804 [10:18:57] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test2004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [10:18:57] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:18:58] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test2004.wikimedia.org [10:19:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet [10:19:28] (03CR) 10Muehlenhoff: [C:03+2] site.pp: Remove idp-test2004 [puppet] - 10https://gerrit.wikimedia.org/r/1161830 (owner: 10Muehlenhoff) [10:19:54] claime: lovely [10:19:57] elukey: so that change would actually only apply for non-mw [10:20:19] mw is already on bookworm for statsd-exporter afaict? [10:20:33] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10962442 (10Vgutierrez) chassis power status seems to be deprecated, `get System.ServerPwr` doesn't provide specific information about the status but `serveraction power... [10:21:10] or was the former tag still buster? [10:21:11] claime: at this point yes, I may have got it right the first time bu forgot to update the default [10:21:34] oh cool so there's actually nothing to do for mw :D [10:21:40] so the tag that you highlighted is bookworm and correct, the -2 is placed correctl [10:21:53] yep yep sorry for the confusion [10:21:58] no worries [10:22:55] godog: ackchually almost 3M lps B-D [10:23:11] between 3 and 3.5 [10:23:15] hah indeed, notbad.gif [10:23:22] get metric'd [10:23:35] lolz [10:24:01] no wonder we were saturating graphite's NIC with udp traffic -.- [10:24:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 21.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:24:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 22.95% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:26:20] !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti2050.codfw.wmnet to cluster codfw and group B [10:27:32] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2050.codfw.wmnet to cluster codfw and group B [10:28:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10962497 (10MoritzMuehlenhoff) [10:29:37] claime: going afk, IIUC all good right? [10:29:44] thanks again [10:29:51] elukey: yeah, looks all right for" [10:29:56] me* [10:29:58] <3 [10:32:50] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp1006.eqiad.wmnet [10:33:19] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup2007.codfw.wmnet: Renew puppet certificate - jynus@cumin1002 [10:37:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good given the build hosts are root-only hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1165477 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:38:16] (03CR) 10Effie Mouzeli: [C:03+1] Revert^2 "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 (owner: 10Ladsgroup) [10:39:04] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1006.eqiad.wmnet [10:40:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:43:28] (03PS1) 10Hnowlan: mw-api-int: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165483 (https://phabricator.wikimedia.org/T397750) [10:44:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 22.48% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:45:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 21.74% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:45:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1165151 (https://phabricator.wikimedia.org/T398245) (owner: 10Scott French) [10:48:18] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10962540 (10Vgutierrez) currently spicerack IPMI module uses `chassis power status` as part of its [[ https://doc.wikimedia.org/spicerack/master/_modules/spicerack/ipmi.... [10:50:38] !log root@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl2001.codfw.wmnet [10:54:42] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp2004.codfw.wmnet [10:57:43] FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:58:27] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl2001.codfw.wmnet [10:58:44] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10962577 (10Clement_Goubert) >>! In T398075#10961146, @Htriedman wrote: > Hi @Clement_Goubert! When I navigate to the L3 document... [11:00:59] (03PS5) 10Effie Mouzeli: trafficserver: remove mwdebugX XWD entries [puppet] - 10https://gerrit.wikimedia.org/r/1164207 (https://phabricator.wikimedia.org/T397498) [11:01:25] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2004.codfw.wmnet [11:01:33] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp2005.codfw.wmnet [11:02:08] (03PS1) 10Clément Goubert: admin::data: Update access for htriedman [puppet] - 10https://gerrit.wikimedia.org/r/1165485 (https://phabricator.wikimedia.org/T398075) [11:02:16] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10962584 (10Clement_Goubert) [11:02:28] FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:03:43] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10962585 (10Vgutierrez) @Jhancock.wm could you try re-running the cookbook again and see if we get more progress now? thanks [11:05:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1165152 (https://phabricator.wikimedia.org/T398245) (owner: 10Scott French) [11:06:07] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: remove git_data_dirs setting [puppet] - 10https://gerrit.wikimedia.org/r/1165033 (https://phabricator.wikimedia.org/T394382) (owner: 10Jelto) [11:08:05] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2005.codfw.wmnet [11:10:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:13:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 22.39% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:17:57] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165486 [11:18:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 23.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:19:36] (03CR) 10Hnowlan: [C:03+1] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165486 (owner: 10Muehlenhoff) [11:23:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10962642 (10Clement_Goubert) >>! In T395917#10962148, @Anton.Kokh wrote: >... [11:23:06] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10962643 (10Clement_Goubert) 05Stalled→03In progress [11:27:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 24.63% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:27:34] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T398297 (10Zaid007) 03NEW [11:29:00] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T398297#10962667 (10Zaid007) a:03Zaid007 [11:29:34] (03PS1) 10Muehlenhoff: Temporarily drop puppetserver1002/2002 for maintenance [dns] - 10https://gerrit.wikimedia.org/r/1165489 [11:30:21] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [11:30:31] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [11:33:41] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10962691 (10MoritzMuehlenhoff) [11:35:49] !log root@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl2002.codfw.wmnet [11:36:00] (03CR) 10Muehlenhoff: [C:03+2] Temporarily drop puppetserver1002/2002 for maintenance [dns] - 10https://gerrit.wikimedia.org/r/1165489 (owner: 10Muehlenhoff) [11:36:06] !log jmm@dns1004 START - running authdns-update [11:37:07] !log jmm@dns1004 END - running authdns-update [11:37:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 20.48% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:38:06] (03CR) 10Clément Goubert: [C:03+1] mw-api-int: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165483 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [11:41:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 21.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:43:13] (03CR) 10Hnowlan: [C:03+2] mw-api-int: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165483 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [11:43:33] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl2002.codfw.wmnet [11:44:53] (03Merged) 10jenkins-bot: mw-api-int: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165483 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [11:45:42] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:45:48] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:45:54] !log root@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl2003.codfw.wmnet [11:46:04] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:47:36] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:47:47] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:49:31] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10962776 (10brouberol) @Clement_Goubert: @Htriedman should have access to all airflow instances as part of... [11:51:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 21.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:51:53] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10962811 (10Clement_Goubert) >>! In T398075#10962776, @brouberol wrote: > @Clement_Goubert: @Htriedman shou... [11:52:14] (03PS70) 10Cathal Mooney: sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [11:53:05] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1165485 (https://phabricator.wikimedia.org/T398075) (owner: 10Clément Goubert) [11:53:16] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl2003.codfw.wmnet [11:53:32] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:36] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:53:52] (03PS71) 10Cathal Mooney: sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [11:54:21] !log root@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl2004.codfw.wmnet [11:56:06] (03PS1) 10Effie Mouzeli: dsh: remove testservers from scap destinations [puppet] - 10https://gerrit.wikimedia.org/r/1165492 (https://phabricator.wikimedia.org/T397498) [11:56:32] (03CR) 10CI reject: [V:04-1] dsh: remove testservers from scap destinations [puppet] - 10https://gerrit.wikimedia.org/r/1165492 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [11:56:55] (03PS1) 10Filippo Giunchedi: thanos: start sampled traces from query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1165493 (https://phabricator.wikimedia.org/T394414) [11:56:57] (03PS1) 10Filippo Giunchedi: thanos: notify services on tracing changes [puppet] - 10https://gerrit.wikimedia.org/r/1165494 (https://phabricator.wikimedia.org/T394414) [11:57:43] (03CR) 10Cathal Mooney: sre.dns.netbox-future cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [11:57:46] (03PS2) 10Effie Mouzeli: dsh: remove testservers from scap destinations [puppet] - 10https://gerrit.wikimedia.org/r/1165492 (https://phabricator.wikimedia.org/T397498) [11:58:03] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10962850 (10BTullis) I have confirmed that membership of `analytics-private... [11:59:33] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl2004.codfw.wmnet [12:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1200) [12:00:09] !log manually clean out external_cloud_vendors directory on puppet 5 frontends to fix Puppet runs [12:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:50] (03CR) 10CI reject: [V:04-1] sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [12:01:35] (03PS72) 10Cathal Mooney: sre.dns.netbox-future cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) [12:02:33] !log root@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl2005.codfw.wmnet [12:03:21] (03CR) 10Clément Goubert: [C:03+2] admin::data: Update access for htriedman [puppet] - 10https://gerrit.wikimedia.org/r/1165485 (https://phabricator.wikimedia.org/T398075) (owner: 10Clément Goubert) [12:07:47] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl2005.codfw.wmnet [12:08:22] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host puppetserver2002.codfw.wmnet [12:13:20] !log root@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl1001.eqiad.wmnet [12:13:31] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:15:19] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2002.codfw.wmnet [12:16:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [12:16:39] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165496 [12:18:03] (03PS1) 10Jgiannelos: mobileapps: Bump staging image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165497 [12:19:49] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Bump staging image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165497 (owner: 10Jgiannelos) [12:20:16] (03PS1) 10Muehlenhoff: Readd puppetserver2002 [dns] - 10https://gerrit.wikimedia.org/r/1165498 [12:20:56] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl1001.eqiad.wmnet [12:21:04] !log installing libcap2 security updates [12:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:37] (03Merged) 10jenkins-bot: mobileapps: Bump staging image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165497 (owner: 10Jgiannelos) [12:21:45] !log root@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl1002.eqiad.wmnet [12:22:05] (03CR) 10Muehlenhoff: [C:03+2] Readd puppetserver2002 [dns] - 10https://gerrit.wikimedia.org/r/1165498 (owner: 10Muehlenhoff) [12:22:12] !log jmm@dns1004 START - running authdns-update [12:22:33] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:23:14] !log jmm@dns1004 END - running authdns-update [12:24:36] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164281 (owner: 10PipelineBot) [12:26:07] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164281 (owner: 10PipelineBot) [12:28:29] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#10963011 (10elukey) p:05Triage→03High [12:28:50] 10SRE-SLO: Add a section to the SLO template that explains SLO windows, and Pyrra's dashboards and alerts - https://phabricator.wikimedia.org/T395920#10963012 (10elukey) 05Open→03Resolved a:03elukey [12:29:03] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#10963015 (10Jclark-ctr) @Jhancock.wm per our conversations on irc yesterday i believe that should be setup under this partman - partman/custom/boss_leavelvm.cfg [12:29:12] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl1002.eqiad.wmnet [12:29:37] 10SRE-SLO: Add a section to the SLO template that explains SLO windows, and Pyrra's dashboards and alerts - https://phabricator.wikimedia.org/T395920#10963018 (10elukey) The page got to its last version and it is now part of the official template, thanks all for the inputs and comments. I've also followed up... [12:30:25] FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:31:05] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:31:24] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:31:48] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 10MW-1.45-notes (1.45.0-wmf.8; 2025-07-01): Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10963026 (10elukey) @DLynch that's great! Next steps: * Wait for the following metri... [12:31:53] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl1003.eqiad.wmnet [12:32:17] !log root@cumin1003 START - Cookbook sre.hosts.reboot-single for host puppetserver1002.eqiad.wmnet [12:32:24] 06SRE, 10SRE-SLO, 10Observability-Metrics: Rework the Pyrra list dashboard - https://phabricator.wikimedia.org/T394415#10963032 (10elukey) p:05Triage→03Medium [12:32:31] 10SRE-SLO, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q4): Reduce Pyrra's default window from 12w to 4w - https://phabricator.wikimedia.org/T395916#10963034 (10elukey) p:05Triage→03Medium [12:32:40] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:33:16] 06SRE, 10SRE-SLO, 10Observability-Metrics: Set a predefined time window in Pyrra's configuration to measure SLOs with - https://phabricator.wikimedia.org/T393796#10963052 (10elukey) p:05Triage→03Medium [12:33:40] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:33:43] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#10963054 (10elukey) [12:34:11] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:34:37] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 337625328 and 51 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frnetmon1001 - https://phabricator.wikimedia.org/T398079#10963058 (10Jclark-ctr) [12:34:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frnetmon1001 - https://phabricator.wikimedia.org/T398079#10963059 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:34:45] (03PS1) 10Jgiannelos: mobileapps: Disable clustering on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165501 [12:34:57] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:35:12] (03CR) 10Hnowlan: [C:03+1] mobileapps: Disable clustering on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165501 (owner: 10Jgiannelos) [12:35:24] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:35:37] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 960 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:35:38] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Disable clustering on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165501 (owner: 10Jgiannelos) [12:37:30] (03Merged) 10jenkins-bot: mobileapps: Disable clustering on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165501 (owner: 10Jgiannelos) [12:38:11] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:38:32] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:38:32] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:38:46] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1002.eqiad.wmnet [12:39:18] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl1003.eqiad.wmnet [12:39:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission backup1002 and its disk array - https://phabricator.wikimedia.org/T398210#10963110 (10Jclark-ctr) [12:39:44] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl1004.eqiad.wmnet [12:39:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission backup1002 and its disk array - https://phabricator.wikimedia.org/T398210#10963115 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:40:25] FIRING: [2x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:41:39] 10SRE-SLO, 10observability: Add links in the Pyrra rolling dashboards to point to their calendar ones in Grafana - https://phabricator.wikimedia.org/T398311 (10elukey) 03NEW [12:41:42] 10SRE-SLO, 10observability: Add links in the Pyrra rolling dashboards to point to their calendar ones in Grafana - https://phabricator.wikimedia.org/T398311#10963183 (10elukey) p:05Triage→03Medium [12:42:24] (03PS4) 10JMeybohm: sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148) [12:43:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10963212 (10WMDE-leszek) Hi @BTullis, thanks, I appreciate thorough checkin... [12:43:48] (03CR) 10Elukey: "I had a chat with Joe, the prod-build credentials are also deployed on CI nodes, and he is worried that we sensitive material could be pul" [puppet] - 10https://gerrit.wikimedia.org/r/1165477 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [12:45:25] FIRING: [3x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:57] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl1004.eqiad.wmnet [12:46:09] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:46:31] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:47:44] (03CR) 10Tiziano Fogli: [C:03+1] pyrra: disable wdqs and istio-related burrates alerts [puppet] - 10https://gerrit.wikimedia.org/r/1165465 (owner: 10Elukey) [12:48:25] 10SRE-SLO, 10observability: Add a banner to slo.wikimedia.org explaining rolling vs calendar views - https://phabricator.wikimedia.org/T398313 (10elukey) 03NEW p:05Triage→03Medium [12:48:38] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [12:49:03] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [12:49:59] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397983#10963267 (10phaultfinder) [12:49:59] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:50:04] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-be100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T397414#10963268 (10Jclark-ctr) [12:50:21] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:50:25] FIRING: [3x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:50:31] 06SRE, 10SRE-SLO, 10Observability-Metrics: Set a predefined time window in Pyrra's configuration to measure SLOs with - https://phabricator.wikimedia.org/T393796#10963271 (10elukey) 05Open→03Resolved a:03elukey After some chats in the SLO working group, we decided to keep the rolling window and the... [12:50:51] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [12:50:58] 10ops-eqiad, 06DC-Ops: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://phabricator.wikimedia.org/T398315 (10phaultfinder) 03NEW [12:51:05] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [12:52:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-be100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T397414#10963283 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:52:10] (03PS1) 10Muehlenhoff: Readd puppetserver1002 [dns] - 10https://gerrit.wikimedia.org/r/1165508 [12:53:03] 06SRE, 10SRE-SLO, 10Observability-Metrics: Rework the Pyrra list dashboard - https://phabricator.wikimedia.org/T394415#10963285 (10elukey) The dashboard has been renamed and restructure, now it looks like this: https://grafana.wikimedia.org/d/YuUMRZ44z/slo-quarterly-review We should probably make it a littl... [12:53:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:55:25] FIRING: [3x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:29] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti5006.eqsin.wmnet with reason: reimage [12:58:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5006.eqsin.wmnet with OS bookworm [12:59:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10963341 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5006.eqsin.wmnet with OS bookworm [12:59:35] !log setup BGP to Paylb on pfw1-eqiad - T397865 [12:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:48] T397865: network and DNS configuration for new eqiad frack pay-lb servers - https://phabricator.wikimedia.org/T397865 [13:00:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [13:00:04] Urbanecm and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1300). [13:00:04] MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:25] RESOLVED: [3x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:29] * MichaelG_WMF is here [13:03:36] (03PS3) 10Elukey: pyrra: disable wdqs and istio-related burrates alerts [puppet] - 10https://gerrit.wikimedia.org/r/1165465 [13:03:36] (03PS2) 10Elukey: pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:04:19] (03CR) 10CI reject: [V:04-1] pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:04:25] (03CR) 10Elukey: pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:54] (03PS3) 10Elukey: pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:06:45] (03PS1) 10Vgutierrez: hiera: Switch lvs7003 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1165516 (https://phabricator.wikimedia.org/T396561) [13:07:27] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165516 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:07:41] (03PS16) 10JHathaway: dhcp: add a UUID based DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 [13:07:48] (03CR) 10Elukey: [C:03+2] pyrra: disable wdqs and istio-related burrates alerts [puppet] - 10https://gerrit.wikimedia.org/r/1165465 (owner: 10Elukey) [13:07:55] (03CR) 10JHathaway: dhcp: add a UUID based DHCP config (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164316 (owner: 10JHathaway) [13:08:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:08:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [13:08:52] MichaelG_WMF: let's move it out [13:09:09] 🙌 [13:09:34] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:59] (03Merged) 10jenkins-bot: Growth: Configure higher impact module edit limits for english and test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164979 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [13:10:25] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1164979|Growth: Configure higher impact module edit limits for english and test wiki (T341599)]] [13:10:32] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [13:10:44] ...that should've been in ext-GE.php, not in IS [13:10:53] too late :/ [13:11:12] (03CR) 10Muehlenhoff: [C:03+2] Readd puppetserver1002 [dns] - 10https://gerrit.wikimedia.org/r/1165508 (owner: 10Muehlenhoff) [13:11:17] !log jmm@dns1004 START - running authdns-update [13:12:19] !log jmm@dns1004 END - running authdns-update [13:12:22] MichaelG_WMF: this cannot really be tested, right? [13:12:24] (not on mwdebug) [13:12:43] mh, why not? [13:12:52] urbanecm: could you ping me once you are done? [13:12:53] is it on mwdebug? [13:12:54] (03PS1) 10Jgreen: Add monitoring for pay-lb100[12] to nsca_frack.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1165519 (https://phabricator.wikimedia.org/T397865) [13:13:00] zabe: sure [13:13:00] !log urbanecm@deploy1003 urbanecm, cyndywikime: Backport for [[gerrit:1164979|Growth: Configure higher impact module edit limits for english and test wiki (T341599)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:13:07] thx [13:13:26] MichaelG_WMF: doesn't it affect the refresh job only? [13:13:41] so mwdebug triggers a job, a regular job runner picks it => old value [13:13:51] but maybe i'm missign something [13:13:56] MichaelG_WMF: anyway, it is on mwdebug now, feel free to test [13:14:07] 🤔 - I would have assumed that it also affects the on-demand calculations of edits [13:14:19] but maybe I missed something. I'll have a look [13:14:25] hmm, it probably would [13:14:28] let's try, we'll see :) [13:14:48] (03PS2) 10Jgreen: Add monitoring for pay-lb100[12] to nsca_frack.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1165519 (https://phabricator.wikimedia.org/T397865) [13:15:59] (03PS1) 10Jelto: miscweb: bump first three miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165520 (https://phabricator.wikimedia.org/T398303) [13:16:25] (03CR) 10Jgreen: [V:03+2 C:03+1] Add monitoring for pay-lb100[12] to nsca_frack.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1165519 (https://phabricator.wikimedia.org/T397865) (owner: 10Jgreen) [13:17:37] but now I have opened to many people's Impact's and gotten myself into the rate limit and getting 429 errors 🤦 [13:17:48] At least I'm not seeing any errors [13:18:03] MichaelG_WMF: should i reset that for you? [13:18:25] urbanecm: if that is this easy? [13:18:28] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1206:9290 - https://phabricator.wikimedia.org/T397978#10963435 (10Jclark-ctr) a:03Jclark-ctr Psu has Failed Opened dell ticket for replacement [13:18:33] MichaelG_WMF: on your staff account? [13:18:54] yep! [13:19:06] and testwiki? or enwiki? [13:19:12] (not sure if that limit is per account or per ip) [13:19:28] per user+wiki [13:19:28] enwiki, I haven't found users with enough edits on testwiki yet [13:20:50] At least, I can confirm that the new limit already exists in the i18n copy^^ [13:21:05] (03CR) 10Arnaudb: [C:03+1] "looks good to me, question inline!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165520 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [13:21:25] MichaelG_WMF: try now? [13:21:26] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5006.eqsin.wmnet with reason: host reimage [13:22:24] (03PS1) 10Elukey: profile::pyrra::filesystem::slo: fix WDQS SLI [puppet] - 10https://gerrit.wikimedia.org/r/1165521 (https://phabricator.wikimedia.org/T393966) [13:22:24] urbanecm: it worked, and I can see the higher numbers! [13:22:42] perfect! [13:22:48] (03CR) 10Gmodena: Revert^2 "Clean up EventBus and jobs config" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 (owner: 10Ladsgroup) [13:22:48] so, we're going ahead, right? [13:23:27] MichaelG_WMF: ^ [13:23:44] urbanecm: Yes, we can roll forward! [13:23:47] !log urbanecm@deploy1003 urbanecm, cyndywikime: Continuing with sync [13:23:51] thanks for confirming, proceeding [13:24:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5006.eqsin.wmnet with reason: host reimage [13:24:38] (03CR) 10Tiziano Fogli: [C:03+1] thanos: notify services on tracing changes [puppet] - 10https://gerrit.wikimedia.org/r/1165494 (https://phabricator.wikimedia.org/T394414) (owner: 10Filippo Giunchedi) [13:29:36] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164979|Growth: Configure higher impact module edit limits for english and test wiki (T341599)]] (duration: 19m 10s) [13:29:42] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [13:29:43] MichaelG_WMF: live [13:29:45] anything else? [13:30:34] (03CR) 10Jelto: [C:03+2] miscweb: bump first three miscweb images to bookworm (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165520 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [13:30:40] urbanecm: not from my side, thank you! [13:30:46] sounds good! [13:30:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10963492 (10MoritzMuehlenhoff) [13:31:54] (That is, I think I would like to have [fix(AddALink): adjust notification copy and icon](https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1165482) backported to -wmf.8, but maybe not in this window) [13:32:35] (03Merged) 10jenkins-bot: miscweb: bump first three miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165520 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [13:32:35] (03CR) 10Ladsgroup: Revert^2 "Clean up EventBus and jobs config" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 (owner: 10Ladsgroup) [13:35:20] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [13:35:21] (03PS1) 10Elukey: pyrra: rename "requests" to "availability" in the Istio SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1165525 (https://phabricator.wikimedia.org/T391852) [13:36:34] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:36:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:37:21] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:39:17] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:39:59] (03CR) 10Zabe: [C:03+2] categorylinks: Set testwiki to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164472 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [13:40:19] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:40:24] (03CR) 10Urbanecm: Revert^2 "Clean up EventBus and jobs config" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 (owner: 10Ladsgroup) [13:40:48] (03Merged) 10jenkins-bot: categorylinks: Set testwiki to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164472 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [13:41:19] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1164472|categorylinks: Set testwiki to read new (T397912)]] [13:41:25] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [13:41:42] zabe: oh, sorry, i forgot to ping you [13:42:19] no worries [13:42:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:43:40] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:43:49] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:44:08] !log zabe@deploy1003 zabe: Backport for [[gerrit:1164472|categorylinks: Set testwiki to read new (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:45:34] !log zabe@deploy1003 zabe: Continuing with sync [13:46:20] (03CR) 10Filippo Giunchedi: [C:03+2] Add monitoring for pay-lb100[12] to nsca_frack.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1165519 (https://phabricator.wikimedia.org/T397865) (owner: 10Jgreen) [13:46:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:47:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:48:09] (03CR) 10Ladsgroup: Revert^2 "Clean up EventBus and jobs config" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 (owner: 10Ladsgroup) [13:48:42] (03PS1) 10D3r1ck01: Update email for shell user "derick" [puppet] - 10https://gerrit.wikimedia.org/r/1165526 [13:50:43] (03CR) 10Urbanecm: Revert^2 "Clean up EventBus and jobs config" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 (owner: 10Ladsgroup) [13:51:03] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164472|categorylinks: Set testwiki to read new (T397912)]] (duration: 09m 44s) [13:51:10] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [13:51:31] !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [13:52:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5006.eqsin.wmnet with OS bookworm [13:52:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10963584 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5006.eqsin.wmnet with OS bookworm completed: - ganeti5006 (**PASS*... [13:53:07] (03Abandoned) 10Arnaudb: gerrit: add readonly.jar [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1164044 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [13:54:56] (03CR) 10Majavah: [C:03+1] Openstack web proxy: allow 'proxyadmin' users to modify proxies [puppet] - 10https://gerrit.wikimedia.org/r/1165154 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [13:55:37] (03CR) 10Majavah: Openstack web proxy: allow 'puppetencadmin' users to modify per-vm puppet config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165155 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [13:57:44] (03PS1) 10Cwhite: logstash: reroute mobileapps webrequests [puppet] - 10https://gerrit.wikimedia.org/r/1165532 (https://phabricator.wikimedia.org/T390215) [13:57:58] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#10963614 (10Jgreen) [14:00:27] (03Abandoned) 10Elukey: profile::pyrra: add SLO ratio for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [14:03:14] (03CR) 10Cwhite: [C:03+2] logstash: reroute mobileapps webrequests [puppet] - 10https://gerrit.wikimedia.org/r/1165532 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [14:05:39] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: decommission cloudcephosd200[12]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397968#10963632 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:08:25] (03PS1) 10Joal: Add accept_language to webrequest_sampled turnilo [puppet] - 10https://gerrit.wikimedia.org/r/1165537 [14:08:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:08:32] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:10:11] msg _joe_ Heya, I have merged the accept_language patch and updated the druid datasource. We'll continue to talk about data size on the ticket. [14:10:24] Now I have a patch for turnilo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1165537 [14:11:08] Ah! those messages were supposed to be sent to _joe_ only... I'll verify my syntax next time [14:11:17] <_joe_> ahah :) [14:11:37] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10963659 (10Jhancock.wm) decom is completed. additional cable for cloudcephosd2007-dev is connected to port 23 [14:12:23] (03CR) 10Giuseppe Lavagetto: [C:03+1] Add accept_language to webrequest_sampled turnilo [puppet] - 10https://gerrit.wikimedia.org/r/1165537 (owner: 10Joal) [14:15:57] (03PS3) 10Majavah: P:toolforge::prometheus: Add scrape rules for Loki/Alloy [puppet] - 10https://gerrit.wikimedia.org/r/1163729 (https://phabricator.wikimedia.org/T386480) [14:16:22] (03PS1) 10Brouberol: deployment_server: group chown all airflow private files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1165538 (https://phabricator.wikimedia.org/T393998) [14:16:49] (03CR) 10CI reject: [V:04-1] deployment_server: group chown all airflow private files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1165538 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [14:17:42] (03PS1) 10Elukey: pyrra: add experimental success ratio template for istio [puppet] - 10https://gerrit.wikimedia.org/r/1165539 (https://phabricator.wikimedia.org/T391852) [14:18:16] (03CR) 10Btullis: [C:03+2] Add accept_language to webrequest_sampled turnilo [puppet] - 10https://gerrit.wikimedia.org/r/1165537 (owner: 10Joal) [14:18:20] (03CR) 10CI reject: [V:04-1] pyrra: add experimental success ratio template for istio [puppet] - 10https://gerrit.wikimedia.org/r/1165539 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [14:18:39] (03PS1) 10Zabe: categorylinks: Set group0 to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165540 (https://phabricator.wikimedia.org/T397912) [14:19:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:19:46] (03PS2) 10Elukey: pyrra: add experimental success ratio template for istio [puppet] - 10https://gerrit.wikimedia.org/r/1165539 (https://phabricator.wikimedia.org/T391852) [14:20:23] (03CR) 10CI reject: [V:04-1] pyrra: add experimental success ratio template for istio [puppet] - 10https://gerrit.wikimedia.org/r/1165539 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [14:20:26] (03PS2) 10Brouberol: deployment_server: group chown all airflow private files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1165538 (https://phabricator.wikimedia.org/T393998) [14:20:51] (03CR) 10David Caro: [C:03+1] P:toolforge::prometheus: Add scrape rules for Loki/Alloy [puppet] - 10https://gerrit.wikimedia.org/r/1163729 (https://phabricator.wikimedia.org/T386480) (owner: 10Majavah) [14:21:03] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Add scrape rules for Loki/Alloy [puppet] - 10https://gerrit.wikimedia.org/r/1163729 (https://phabricator.wikimedia.org/T386480) (owner: 10Majavah) [14:21:49] (03CR) 10Cathal Mooney: [C:03+1] P:bird and C:bird::anycast: support exporting Prom metrics [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:22:28] FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:35] (03CR) 10Cathal Mooney: [C:03+1] "LGTM yeah. We can always revert if it skews the balance too much." [homer/public] - 10https://gerrit.wikimedia.org/r/1164972 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi) [14:23:13] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti2021 / ganeti2022 - https://phabricator.wikimedia.org/T398182#10963754 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:23:25] (03CR) 10Scott French: [C:03+2] aptrepo: add php83 component and pcre2 updates [puppet] - 10https://gerrit.wikimedia.org/r/1165151 (https://phabricator.wikimedia.org/T398245) (owner: 10Scott French) [14:23:30] (03CR) 10Ayounsi: [C:03+2] Remove some Arelion/NTT traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/1164972 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi) [14:24:15] (03Merged) 10jenkins-bot: Remove some Arelion/NTT traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/1164972 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi) [14:24:43] (03PS3) 10Brouberol: deployment_server: group chown all airflow private files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1165538 (https://phabricator.wikimedia.org/T393998) [14:25:03] (03CR) 10Cathal Mooney: "Great work! Overall LGTM thanks :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1161448 (owner: 10Effie Mouzeli) [14:25:28] (03CR) 10Scott French: [C:03+2] package_builder: add pbuilder hook for component/php83 [puppet] - 10https://gerrit.wikimedia.org/r/1165152 (https://phabricator.wikimedia.org/T398245) (owner: 10Scott French) [14:25:41] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw [14:25:57] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1165538 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [14:26:06] !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad [14:28:39] (03CR) 10Brouberol: [C:03+2] deployment_server: group chown all airflow private files to airflow-deployers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165538 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [14:29:10] FIRING: BFDdown: BFD session down between cr1-drmrs and fe80::8618:88ff:fe0d:dc64 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:29:43] (03PS3) 10Elukey: pyrra: add experimental success ratio template for istio [puppet] - 10https://gerrit.wikimedia.org/r/1165539 (https://phabricator.wikimedia.org/T391852) [14:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430) [14:30:09] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1165494 (https://phabricator.wikimedia.org/T394414) (owner: 10Filippo Giunchedi) [14:31:00] (03CR) 10Kosta Harlan: hcaptcha: initial commit for proxy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková) [14:31:12] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission backup2002 and its disk array - https://phabricator.wikimedia.org/T398212#10963797 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:33:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:34:10] RESOLVED: BFDdown: BFD session down between cr1-drmrs and fe80::8618:88ff:fe0d:dc64 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:35:25] (03PS5) 10JMeybohm: sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148) [14:35:40] (03PS5) 10Jcrespo: bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) [14:36:40] (03CR) 10Muehlenhoff: [C:03+1] "This looks good (to the extent that Partman recipes can look good)" [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) (owner: 10Fabfur) [14:37:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet [14:38:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:40:54] (03PS1) 10Jelto: miscweb: bump another three miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165543 (https://phabricator.wikimedia.org/T398303) [14:43:29] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10963938 (10MoritzMuehlenhoff) [14:43:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:46:19] (03CR) 10Arnaudb: [C:03+1] miscweb: bump another three miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165543 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [14:46:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5006.eqsin.wmnet to cluster eqsin and group 1 [14:47:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet [14:49:50] jmm@cumin2002 addnode (PID 3734260) is awaiting input [14:51:26] (03PS1) 10Elukey: pyrra: add tonecheck Pyrra config [puppet] - 10https://gerrit.wikimedia.org/r/1165548 (https://phabricator.wikimedia.org/T390706) [14:52:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti5006.eqsin.wmnet to cluster eqsin and group 1 [14:52:56] (03CR) 10Jelto: [C:03+2] miscweb: bump another three miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165543 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [14:53:04] (03PS1) 10Bernard Wang: Enable mobile search recommendations in all eligible wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 [14:53:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 (owner: 10Bernard Wang) [14:53:59] (03CR) 10CI reject: [V:04-1] Enable mobile search recommendations in all eligible wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 (owner: 10Bernard Wang) [14:54:06] !log failover Ganeti master in eqsin to ganeti5004 [14:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:05] (03Merged) 10jenkins-bot: miscweb: bump another three miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165543 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [14:55:13] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:55:44] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [14:56:38] PROBLEM - ganeti-wconfd running on ganeti5007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:57:14] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [15:00:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [15:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430) [15:00:05] jelto, arnoldokoth, and mutante: Time to snap out of that daydream and deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1500). [15:00:11] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:01:00] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:02:06] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10964114 (10MoritzMuehlenhoff) [15:02:42] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:02:51] (03CR) 10Clément Goubert: [C:03+1] sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [15:02:57] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [15:03:08] (03CR) 10Muehlenhoff: [C:03+2] Switch mc-wf1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1161508 (owner: 10Muehlenhoff) [15:04:34] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [15:06:31] (03PS1) 10Elukey: services: configure tegola in codfw to use maps-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) [15:06:32] (03PS1) 10Elukey: services: move kartotherian codfw to the maps-test postgres cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165551 (https://phabricator.wikimedia.org/T381565) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-wf1001.eqiad.wmnet [15:08:14] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-eqiad [15:08:46] !log brennen@deploy1003 Started deploy [phabricator/deployment@311587a]: deploy phab2002 for T398328 [15:08:49] T398328: Deploy Phabricator/Phorge 2025-07-01 - https://phabricator.wikimedia.org/T398328 [15:09:06] (03CR) 10Elukey: services: configure tegola in codfw to use maps-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [15:09:26] !log brennen@deploy1003 Finished deploy [phabricator/deployment@311587a]: deploy phab2002 for T398328 (duration: 00m 41s) [15:09:46] !log brennen@deploy1003 Started deploy [phabricator/deployment@311587a]: deploy phab1004 for T398328 [15:10:04] (03CR) 10Majavah: [C:03+1] keystone policy: allow object_storage role to create/delete ec2 creds [puppet] - 10https://gerrit.wikimedia.org/r/1163864 (https://phabricator.wikimedia.org/T396594) (owner: 10Andrew Bogott) [15:10:24] !log brennen@deploy1003 Finished deploy [phabricator/deployment@311587a]: deploy phab1004 for T398328 (duration: 00m 37s) [15:12:12] (03CR) 10LorenMora: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 (owner: 10Bernard Wang) [15:13:12] (03CR) 10Ilias Sarantopoulos: [C:03+1] "awesome! thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1165548 (https://phabricator.wikimedia.org/T390706) (owner: 10Elukey) [15:13:19] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [15:13:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1001.eqiad.wmnet [15:15:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165551 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [15:18:03] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10964224 (10ops-monitoring-bot) Draining ganeti5007.eqsin.wmnet of running VMs [15:19:15] (03PS1) 10Andrew Bogott: Openstack cinder: create cinder user with proper shell and home dir [puppet] - 10https://gerrit.wikimedia.org/r/1165555 [15:19:15] (03CR) 10Bernard Wang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 (owner: 10Bernard Wang) [15:19:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet [15:20:03] (03CR) 10Andrew Bogott: [C:03+2] keystone policy: allow object_storage role to create/delete ec2 creds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163864 (https://phabricator.wikimedia.org/T396594) (owner: 10Andrew Bogott) [15:21:12] (03CR) 10Herron: [C:03+1] thanos: start sampled traces from query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1165493 (https://phabricator.wikimedia.org/T394414) (owner: 10Filippo Giunchedi) [15:21:41] (03CR) 10Herron: [C:03+1] thanos: notify services on tracing changes [puppet] - 10https://gerrit.wikimedia.org/r/1165494 (https://phabricator.wikimedia.org/T394414) (owner: 10Filippo Giunchedi) [15:23:26] (03CR) 10Herron: [C:03+1] pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:23:44] (03PS2) 10Andrew Bogott: Openstack web proxy: allow 'proxyadmin' users to modify proxies [puppet] - 10https://gerrit.wikimedia.org/r/1165154 (https://phabricator.wikimedia.org/T273150) [15:23:45] (03PS2) 10Andrew Bogott: Openstack web proxy: allow 'puppetencadmin' users to modify per-vm puppet config [puppet] - 10https://gerrit.wikimedia.org/r/1165155 (https://phabricator.wikimedia.org/T273150) [15:24:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10964266 (10BCornwall) [15:24:25] (03CR) 10Vgutierrez: [C:03+2] install_server: UEFI setup for cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) (owner: 10Fabfur) [15:24:33] (03CR) 10Herron: [C:03+1] "thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1165521 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [15:25:47] (03CR) 10Herron: [C:03+1] "LGTM lets try it" [puppet] - 10https://gerrit.wikimedia.org/r/1165539 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [15:26:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [15:26:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10964276 (10ops-monitoring-bot) Draining ganeti5007.eqsin.wmnet of running VMs [15:27:22] (03CR) 10Andrew Bogott: [C:03+2] Openstack cinder: create cinder user with proper shell and home dir [puppet] - 10https://gerrit.wikimedia.org/r/1165555 (owner: 10Andrew Bogott) [15:28:30] (03CR) 10Herron: [C:03+1] "LGTM! let's see how this goes for a bit and assuming all is well update the other "requests" SLOs we have today to match this convention" [puppet] - 10https://gerrit.wikimedia.org/r/1165525 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [15:28:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10964303 (10ayounsi) @Andrew Would it be possible to use a single 25G uplink (cf. {T325531}) to make it better with automation and overall design (all reas... [15:29:03] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10964305 (10elukey) Hi! I took a quick look and it seems that the function `force_http_boot_once` contacts Redfish and gets an unexpected response: ` PATCH https://10.193.3.242/redfish/... [15:29:49] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10964306 (10MoritzMuehlenhoff) [15:31:01] (03CR) 10Herron: [C:03+1] pyrra: add tonecheck Pyrra config [puppet] - 10https://gerrit.wikimedia.org/r/1165548 (https://phabricator.wikimedia.org/T390706) (owner: 10Elukey) [15:31:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to airflow-an and statboxes for htriedman - https://phabricator.wikimedia.org/T398075#10964316 (10Htriedman) 05In progress→03Resolved @Clement_Goubert just verified that I can get into stat10XX and an-airflo... [15:31:48] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1165493 (https://phabricator.wikimedia.org/T394414) (owner: 10Filippo Giunchedi) [15:32:16] (03Abandoned) 10Herron: grizzly: adapt managed dashboards to 0.2 metadata approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/903776 (https://phabricator.wikimedia.org/T332895) (owner: 10Herron) [15:32:30] (03Abandoned) 10Herron: etcd: add etcd-backup-v3 script [puppet] - 10https://gerrit.wikimedia.org/r/1120602 (https://phabricator.wikimedia.org/T385727) (owner: 10Herron) [15:32:47] (03CR) 10Slyngshede: [C:03+1] "Looks correct, as compared to the other migrated lvs hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1165516 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [15:33:54] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs7003 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1165516 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [15:35:16] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs7003.magru.wmnet with reason: katran migration [15:37:50] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [15:37:54] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [15:44:10] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs7003.magru.wmnet [15:44:11] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs7003.magru.wmnet [15:45:42] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:45:53] (03PS1) 10Bking: wdqs: improve blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1165562 (https://phabricator.wikimedia.org/T398341) [15:46:31] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165562 (https://phabricator.wikimedia.org/T398341) (owner: 10Bking) [15:47:21] (03PS1) 10Vgutierrez: hiera: Consolidate katran config for magru [puppet] - 10https://gerrit.wikimedia.org/r/1165563 (https://phabricator.wikimedia.org/T396561) [15:50:45] (03PS2) 10Bking: wdqs: improve blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1165562 (https://phabricator.wikimedia.org/T398341) [15:50:58] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165562 (https://phabricator.wikimedia.org/T398341) (owner: 10Bking) [15:51:16] !log renamed 1 user for Unicode title-case transition - T396903 [15:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:19] T396903: Rename pages, images, and users to reflect migration to PHP 8.1 (Unicode 14) title-casing behavior - https://phabricator.wikimedia.org/T396903 [15:53:01] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165563 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [15:53:32] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:36] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:54:35] !log starting page renames for Unicode title-case transition - T396903 [15:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [16:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430) [16:00:05] jhathaway and moritzm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:18] (03CR) 10Btullis: [C:03+1] wdqs: improve blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1165562 (https://phabricator.wikimedia.org/T398341) (owner: 10Bking) [16:01:38] !log finished page renames for Unicode title-case transition - T396903 [16:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:40] T396903: Rename pages, images, and users to reflect migration to PHP 8.1 (Unicode 14) title-casing behavior - https://phabricator.wikimedia.org/T396903 [16:02:08] if there are no objections, I will be running a quick time-critical mediawiki-config deployment shortly [16:02:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152295 (https://phabricator.wikimedia.org/T394556) (owner: 10Scott French) [16:03:05] (03CR) 10Cathal Mooney: New function to generate device-specific IBGP data from cluster YAML (034 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1151793 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [16:03:39] (03Merged) 10jenkins-bot: Remove title-case overrides for PHP 8.1 migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152295 (https://phabricator.wikimedia.org/T394556) (owner: 10Scott French) [16:04:04] !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1152295|Remove title-case overrides for PHP 8.1 migration (T394556)]] [16:04:07] T394556: Clean up UcfirstOverrides.php following PHP 7.4 -> 8.1 transition - https://phabricator.wikimedia.org/T394556 [16:06:12] !log swfrench@deploy1003 swfrench: Backport for [[gerrit:1152295|Remove title-case overrides for PHP 8.1 migration (T394556)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:06:28] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc2013 [16:06:38] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2013 [16:07:15] (03PS1) 10Jgiannelos: mobileapps: Disable profiler on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165565 [16:07:28] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:07:48] (03CR) 10Bking: [C:03+2] wdqs: improve blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1165562 (https://phabricator.wikimedia.org/T398341) (owner: 10Bking) [16:07:51] !log swfrench@deploy1003 swfrench: Continuing with sync [16:08:15] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [16:10:02] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:10:53] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:11:14] !log bking@prometheus1005:~$ sudo run-puppet-agent T398341 [16:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:19] T398341: Improve WDQS health monitor - https://phabricator.wikimedia.org/T398341 [16:13:25] !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152295|Remove title-case overrides for PHP 8.1 migration (T394556)]] (duration: 09m 21s) [16:13:31] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:13:31] T394556: Clean up UcfirstOverrides.php following PHP 7.4 -> 8.1 transition - https://phabricator.wikimedia.org/T394556 [16:17:46] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [16:22:28] FIRING: [4x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:25:57] (03PS2) 10Jgiannelos: mobileapps: Disable profiler on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165565 [16:27:28] FIRING: [6x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:29:54] (03PS3) 10Cathal Mooney: Rename cloud-in to cloud-vrf-in [homer/public] - 10https://gerrit.wikimedia.org/r/1159415 [16:31:04] (03CR) 10Cathal Mooney: [C:03+2] Rename cloud-in to cloud-vrf-in [homer/public] - 10https://gerrit.wikimedia.org/r/1159415 (owner: 10Cathal Mooney) [16:31:32] (03CR) 10Hnowlan: [C:03+1] mobileapps: Disable profiler on staging. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165565 (owner: 10Jgiannelos) [16:31:34] (03Merged) 10jenkins-bot: Rename cloud-in to cloud-vrf-in [homer/public] - 10https://gerrit.wikimedia.org/r/1159415 (owner: 10Cathal Mooney) [16:32:15] (03CR) 10BCornwall: [C:03+1] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1162852 (https://phabricator.wikimedia.org/T397612) (owner: 10Gerrit maintenance bot) [16:33:07] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Disable profiler on staging. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165565 (owner: 10Jgiannelos) [16:33:38] (03PS1) 10BCornwall: Revert^2 "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1165570 [16:33:44] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: civi2001 - https://phabricator.wikimedia.org/T397380#10964583 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:34:40] (03CR) 10BCornwall: [C:03+1] wmnet: add discovery records for thumbor [dns] - 10https://gerrit.wikimedia.org/r/1164457 (https://phabricator.wikimedia.org/T397618) (owner: 10Hnowlan) [16:34:49] (03Merged) 10jenkins-bot: mobileapps: Disable profiler on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165565 (owner: 10Jgiannelos) [16:35:11] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1165571 [16:35:21] (03CR) 10Vgutierrez: [C:03+1] Revert^2 "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1165570 (owner: 10BCornwall) [16:36:06] (03CR) 10BCornwall: [C:03+2] Revert^2 "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1165570 (owner: 10BCornwall) [16:37:14] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:37:32] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:38:19] (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudcephosd2003-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/1164284 (https://phabricator.wikimedia.org/T397968) (owner: 10Andrew Bogott) [16:40:02] (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6121/co" [puppet] - 10https://gerrit.wikimedia.org/r/1165571 (owner: 10Herron) [16:42:04] (03Abandoned) 10Ayounsi: border-in: add custom logging term for BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1055974 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [16:42:20] (03PS2) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1165187 (owner: 10Ncmonitor) [16:44:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repool pc3 T378715', diff saved to https://phabricator.wikimedia.org/P78734 and previous config saved to /var/cache/conftool/dbconfig/20250701-164405-ladsgroup.json [16:44:09] T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715 [16:45:02] !log andrew@cumin1003 START - Cookbook sre.hosts.decommission for hosts cloudcephosd2003-dev.codfw.wmnet [16:45:15] (03CR) 10Majavah: Rename cloud-in to cloud-vrf-in (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1159415 (owner: 10Cathal Mooney) [16:45:45] (03CR) 10Andrew Bogott: [C:03+2] Remove puppet refs to cloudcephosd2003 [puppet] - 10https://gerrit.wikimedia.org/r/1164285 (https://phabricator.wikimedia.org/T397979) (owner: 10Andrew Bogott) [16:45:46] (03PS2) 10Herron: pyrra-filesystem: clear output file on service stop [puppet] - 10https://gerrit.wikimedia.org/r/1165571 (https://phabricator.wikimedia.org/T302995) [16:45:55] (03PS4) 10Andrew Bogott: Remove puppet refs to cloudcephosd2003 [puppet] - 10https://gerrit.wikimedia.org/r/1164285 (https://phabricator.wikimedia.org/T397979) [16:45:55] (03CR) 10Cathal Mooney: [C:03+2] Rename cloud-in to cloud-vrf-in (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1159415 (owner: 10Cathal Mooney) [16:49:07] (03Abandoned) 10Ayounsi: Varnish: prefix 403 and 429 with a unique ID [puppet] - 10https://gerrit.wikimedia.org/r/903284 (https://phabricator.wikimedia.org/T330973) (owner: 10Ayounsi) [16:50:24] (03CR) 10Andrew Bogott: [C:03+2] Remove puppet refs to cloudcephosd2003 [puppet] - 10https://gerrit.wikimedia.org/r/1164285 (https://phabricator.wikimedia.org/T397979) (owner: 10Andrew Bogott) [16:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [16:51:41] !log andrew@cumin1003 START - Cookbook sre.dns.netbox [16:52:14] (03PS1) 10Cathal Mooney: WMCS: Update ACL / filter name for interface towards Cloud VRF [homer/public] - 10https://gerrit.wikimedia.org/r/1165575 [16:54:19] (03Abandoned) 10Cathal Mooney: Update border-in firewall filter to set DSCP bits to DE [homer/public] - 10https://gerrit.wikimedia.org/r/931262 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [16:55:57] !log andrew@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd2003-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1003" [16:56:41] !log andrew@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd2003-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1003" [16:56:42] !log andrew@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:56:42] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephosd2003-dev.codfw.wmnet [16:57:43] (03PS2) 10Cathal Mooney: WMCS: Update ACL / filter name for interface towards Cloud VRF [homer/public] - 10https://gerrit.wikimedia.org/r/1165575 [16:58:55] (03CR) 10Majavah: [C:03+1] WMCS: Update ACL / filter name for interface towards Cloud VRF [homer/public] - 10https://gerrit.wikimedia.org/r/1165575 (owner: 10Cathal Mooney) [16:59:02] 10ops-codfw, 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission cloudcephosd2003-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397979#10964717 (10Andrew) a:05Andrew→03None [16:59:40] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10964721 (10Andrew) 05Open→03Resolved [17:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [17:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1700) [17:00:10] (03CR) 10Ayounsi: [C:03+1] WMCS: Update ACL / filter name for interface towards Cloud VRF [homer/public] - 10https://gerrit.wikimedia.org/r/1165575 (owner: 10Cathal Mooney) [17:02:17] (03CR) 10Cathal Mooney: [C:03+2] WMCS: Update ACL / filter name for interface towards Cloud VRF [homer/public] - 10https://gerrit.wikimedia.org/r/1165575 (owner: 10Cathal Mooney) [17:02:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet [17:02:48] (03Merged) 10jenkins-bot: WMCS: Update ACL / filter name for interface towards Cloud VRF [homer/public] - 10https://gerrit.wikimedia.org/r/1165575 (owner: 10Cathal Mooney) [17:07:28] RESOLVED: [4x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:10:28] (03PS1) 10Cwhite: logstash: deploy phatality 2.7.0.3 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1165578 (https://phabricator.wikimedia.org/T398305) [17:10:30] (03PS1) 10Cwhite: logstash: deploy phatality 2.7.0.3 to production [puppet] - 10https://gerrit.wikimedia.org/r/1165579 (https://phabricator.wikimedia.org/T398305) [17:11:17] (03CR) 10Cwhite: [C:03+2] logstash: deploy phatality 2.7.0.3 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1165578 (https://phabricator.wikimedia.org/T398305) (owner: 10Cwhite) [17:27:33] (03PS2) 10Bernard Wang: Enable mobile search recommendations in all eligible wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 [17:28:22] (03CR) 10CI reject: [V:04-1] Enable mobile search recommendations in all eligible wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 (owner: 10Bernard Wang) [17:34:41] FIRING: [3x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:38:13] (03PS1) 10CDanis: haproxy: use_benthos=>true [puppet] - 10https://gerrit.wikimedia.org/r/1165582 (https://phabricator.wikimedia.org/T329332) [17:39:22] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10964891 (10Jhancock.wm) I believe so. It has the 1 CPU for lower power consumption and twelve 16GB disks. If i remember correctly, that was the intention for these. [17:39:41] FIRING: [48x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:40:10] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 10MW-1.45-notes (1.45.0-wmf.8; 2025-07-01): Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10964895 (10DLynch) Needs a follow-up because the patch isn't logging correctly. [17:59:06] (03PS3) 10Andrea Denisse: centrallog: Log with standard and custom template [puppet] - 10https://gerrit.wikimedia.org/r/1163901 [17:59:06] (03CR) 10Andrea Denisse: "Hi folks, I tested this in Pontoon. I'd greatly appreciate your feedback on this approach." [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (owner: 10Andrea Denisse) [18:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [18:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430) [18:00:05] jnuche and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1800). [18:01:23] (03PS4) 10Andrea Denisse: centrallog: Log with standard and custom template [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) [18:08:32] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:09:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:10:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10964951 (10Andrew) >>! In T394333#10964303, @ayounsi wrote: > @Andrew Would it be possible to use a single 25G uplink (cf. {T325531}) to make it better wi... [18:13:32] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:19:05] (03PS1) 10Bernard Wang: Update mobile search overlay temporary input styles [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1165585 [18:30:45] (03CR) 10Cwhite: [C:03+1] "I see only minor stylistic problems with the patch as is." [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [18:33:48] (03CR) 10Cwhite: [C:03+1] "Sent before I finished my thought 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [18:35:20] (03CR) 10Cwhite: [C:03+2] logstash: add tests to normalize_labels [puppet] - 10https://gerrit.wikimedia.org/r/1160239 (https://phabricator.wikimedia.org/T368956) (owner: 10Cwhite) [18:45:49] (03PS1) 10Cwhite: logstash: test filter_on_template_v2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1165586 (https://phabricator.wikimedia.org/T234565) [18:48:08] (03CR) 10Cwhite: "PCC ok: https://puppet-compiler.wmflabs.org/output/1165586/6124/" [puppet] - 10https://gerrit.wikimedia.org/r/1165586 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [18:49:15] (03PS3) 10Bernard Wang: Enable mobile search recommendations in all eligible wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 [18:50:12] (03CR) 10CI reject: [V:04-1] Enable mobile search recommendations in all eligible wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 (owner: 10Bernard Wang) [18:55:46] jhathaway@cumin2002 interactive (PID 3780986) is awaiting input [18:56:53] (03PS1) 10Andrew Bogott: cloudceph: move per-host puppet7 def to role [puppet] - 10https://gerrit.wikimedia.org/r/1165587 [18:59:52] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658#10965044 (10phaultfinder) [19:00:34] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: T383173 [19:00:37] T383173: Supermicro: UEFI HTTP boot request hangs on cold boot - https://phabricator.wikimedia.org/T383173 [19:00:48] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm [19:01:05] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1165587 (owner: 10Andrew Bogott) [19:02:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163483 (https://phabricator.wikimedia.org/T397788) (owner: 10ZhaoFJx) [19:02:09] (03PS1) 10DLynch: Edit check: fix counter logging for SLO [extensions/VisualEditor] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165589 (https://phabricator.wikimedia.org/T395444) [19:02:50] (03PS5) 10Andrea Denisse: centrallog: Log with standard and custom template [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) [19:05:54] Anyone who'd be inconvenienced if I did that backport quickly? [19:06:07] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965081 (10VRiley-WMF) 05Open→03In progress @Eevans Taking a look at this now [19:07:29] (03CR) 10LorenMora: [C:03+1] Enable mobile search recommendations in all eligible wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 (owner: 10Bernard Wang) [19:08:01] (03CR) 10LorenMora: [C:03+1] Update mobile search overlay temporary input styles [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1165585 (owner: 10Bernard Wang) [19:08:32] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:12:31] (03CR) 10Andrea Denisse: centrallog: Log with standard and custom template (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [19:13:47] Nobody chimed up, so I am going to do said backport. [19:14:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165589 (https://phabricator.wikimedia.org/T395444) (owner: 10DLynch) [19:15:58] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965116 (10VRiley-WMF) Pulling sessionstore1005 down now to reseat cables. [19:16:34] (03Merged) 10jenkins-bot: Edit check: fix counter logging for SLO [extensions/VisualEditor] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165589 (https://phabricator.wikimedia.org/T395444) (owner: 10DLynch) [19:16:59] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1165589|Edit check: fix counter logging for SLO (T395444)]] [19:17:02] T395444: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444 [19:19:04] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1165589|Edit check: fix counter logging for SLO (T395444)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:20:07] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [19:20:17] !log kemayo@deploy1003 kemayo: Continuing with sync [19:23:19] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [19:24:22] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudcephosd200[567]-dev service implementation - https://phabricator.wikimedia.org/T397237#10965132 (10Andrew) 05Open→03Resolved [19:25:21] (03CR) 10Dzahn: [C:03+2] Phabricator: Update recipients of quarterly metrics mail [puppet] - 10https://gerrit.wikimedia.org/r/1165464 (owner: 10Aklapper) [19:25:42] (03CR) 10Dzahn: [C:03+2] "tech debt" [puppet] - 10https://gerrit.wikimedia.org/r/1165464 (owner: 10Aklapper) [19:26:06] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165589|Edit check: fix counter logging for SLO (T395444)]] (duration: 09m 07s) [19:26:09] T395444: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444 [19:27:38] (03CR) 10Dzahn: [C:03+2] Phabricator monthly email: Rename var (now reserved word in MariaDB) [puppet] - 10https://gerrit.wikimedia.org/r/1165461 (https://phabricator.wikimedia.org/T398267) (owner: 10Aklapper) [19:32:39] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965146 (10VRiley-WMF) Powering up sessionstore1005 now, it looks like there was a BIOS update that was pushed to it at some point. It's installing the BIOS update now. [19:32:49] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 10MW-1.45-notes (1.45.0-wmf.9; 2025-07-08): Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10965147 (10DLynch) Okay, all fixed. Issue was that we'd merged `ve.track( 'stats.med... [19:33:54] (03CR) 10Cwhite: [C:03+2] logstash: add tests to dot_expander [puppet] - 10https://gerrit.wikimedia.org/r/1160240 (https://phabricator.wikimedia.org/T368956) (owner: 10Cwhite) [19:34:50] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658#10965166 (10phaultfinder) [19:35:41] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965181 (10VRiley-WMF) @Eevans It looks like the unit is back up. Can you please confirm? [19:37:22] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:42:24] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:43:33] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm [19:47:25] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1005.eqiad.wmnet with OS bullseye [19:47:39] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10965202 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.e... [19:49:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164637 (https://phabricator.wikimedia.org/T398107) (owner: 10EggRoll97) [19:49:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162158 (https://phabricator.wikimedia.org/T265726) (owner: 10EggRoll97) [19:50:21] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:50:23] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:53:32] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:36] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [20:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250630T1430) [20:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T2000) [20:00:05] bwang, ZhaoFJx, and EggRoll97: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965221 (10Eevans) >>! In T398225#10965181, @VRiley-WMF wrote: > @Eevans It looks like the unit is back up. Can you please confirm? I don't know what to say about the BIOS update —that's weird— that wasn... [20:00:13] o/ [20:00:17] o/ [20:02:27] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965232 (10VRiley-WMF) So, yes. I did reseat the cables and that seemed to have helped it get to this point. [20:02:59] (03PS4) 10Bernard Wang: Enable mobile search recommendations in all eligible wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 [20:03:30] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: T383173 [20:03:32] T383173: Supermicro: UEFI HTTP boot request hangs on cold boot - https://phabricator.wikimedia.org/T383173 [20:04:12] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore1005.eqiad.wmnet with OS bullseye [20:04:25] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10965234 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad... [20:04:36] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1005.eqiad.wmnet with OS bullseye [20:04:48] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10965235 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1005.e... [20:04:49] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965236 (10VRiley-WMF) 05In progress→03Open [20:12:12] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:12:34] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:13:28] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8998 bytes in 4.666 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:14:04] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.655 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:52] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965257 (10Jclark-ctr) a:03VRiley-WMF [20:16:59] (03PS1) 10JHathaway: WIP: do not merge [cookbooks] - 10https://gerrit.wikimedia.org/r/1165598 [20:19:24] Sorry, did I miss anything? [20:19:58] disconnected accidentally [20:20:07] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:20:10] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:20:58] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1005.eqiad.wmnet with reason: host reimage [20:24:50] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1005.eqiad.wmnet with reason: host reimage [20:26:24] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:26:26] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:26:39] is a deployer needed? sorry my mtg went over [20:27:41] bwang? [20:27:43] ZhaoFJx? [20:27:50] EggRoll97? [20:27:55] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965294 (10VRiley-WMF) @Eevans is it okay to close this now? [20:28:09] cjming would be great [20:28:17] if you could deploy [20:28:23] alrighty [20:28:27] thanks [20:28:34] ZhaoFJx: i'll do yours now [20:28:43] (03PS2) 10ZhaoFJx: zhwiki: Permissions change for abusefilter groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163483 (https://phabricator.wikimedia.org/T397788) [20:29:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163483 (https://phabricator.wikimedia.org/T397788) (owner: 10ZhaoFJx) [20:30:49] (03Merged) 10jenkins-bot: zhwiki: Permissions change for abusefilter groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163483 (https://phabricator.wikimedia.org/T397788) (owner: 10ZhaoFJx) [20:31:13] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1163483|zhwiki: Permissions change for abusefilter groups (T397788)]] [20:31:16] T397788: zhwiki: Grant abusefilter-access-protected-vars to abusefilter and abusefilter-helper - https://phabricator.wikimedia.org/T397788 [20:33:20] !log cjming@deploy1003 zhaofjx, cjming: Backport for [[gerrit:1163483|zhwiki: Permissions change for abusefilter groups (T397788)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:33:27] checking [20:33:39] thanks [20:34:29] (03CR) 10Eevans: [C:03+2] sessionstore1005: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165016 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [20:35:50] cjming works well [20:36:04] nice - syncing [20:36:14] !log cjming@deploy1003 zhaofjx, cjming: Continuing with sync [20:37:58] (03PS1) 10Dzahn: cloudvps/devtools: fix FQDN for Phabricator test instance [puppet] - 10https://gerrit.wikimedia.org/r/1165600 (https://phabricator.wikimedia.org/T397626) [20:39:41] RESOLVED: [48x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:41:48] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163483|zhwiki: Permissions change for abusefilter groups (T397788)]] (duration: 10m 35s) [20:41:51] T397788: zhwiki: Grant abusefilter-access-protected-vars to abusefilter and abusefilter-helper - https://phabricator.wikimedia.org/T397788 [20:42:10] ZhaoFJx: should be live! [20:42:42] checked again without testserver - working! [20:42:50] cjming thank you for deploy :) [20:42:54] yw! [20:43:33] if anyone else needs something deployed in the next 15, please feel free to ping me - i'll hang out for a bit [20:45:06] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm [20:45:08] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:45:15] (03CR) 10Dzahn: [C:03+2] cloudvps/devtools: fix FQDN for Phabricator test instance [puppet] - 10https://gerrit.wikimedia.org/r/1165600 (https://phabricator.wikimedia.org/T397626) (owner: 10Dzahn) [20:46:07] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1005.eqiad.wmnet with OS bullseye [20:46:22] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10965371 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1005.eqiad... [20:48:11] jhathaway@cumin2002 provision (PID 3803422) is awaiting input [20:48:26] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [20:53:32] FIRING: [2x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:59:34] FIRING: [2x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430) [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T2100) [21:00:57] Web will be using the window today [21:03:45] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [21:03:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:06:41] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [21:08:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:20:53] (03PS2) 10Samtar: errorpage.html.erb: Use flex for page layout [puppet] - 10https://gerrit.wikimedia.org/r/1139049 (https://phabricator.wikimedia.org/T392692) [21:21:21] (03CR) 10Ryan Kemper: [C:03+2] Update plugins for extended regex support [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1143156 (https://phabricator.wikimedia.org/T317599) (owner: 10Ebernhardson) [21:22:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:23:32] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:25:01] (03PS1) 10Dr0ptp4kt: WIP DNM: Access to airflow-platform-eng [puppet] - 10https://gerrit.wikimedia.org/r/1165605 [21:25:33] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm [21:25:42] (03CR) 10CI reject: [V:04-1] WIP DNM: Access to airflow-platform-eng [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (owner: 10Dr0ptp4kt) [21:25:50] (03PS2) 10Dr0ptp4kt: WIP DNM: Access to airflow-platform-eng [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) [21:25:59] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965453 (10Eevans) 05Open→03Resolved >>! In T398225#10965294, @VRiley-WMF wrote: > @Eevans is it okay to close this now? Yes; Let's. Thanks for your help! [21:26:31] (03CR) 10CI reject: [V:04-1] WIP DNM: Access to airflow-platform-eng [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) (owner: 10Dr0ptp4kt) [21:28:12] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10965457 (10VRiley-WMF) Is there a time when we can plan for me to look and try to swap at least one of those drives? I'll need to power down the unit to see where those drives may be located at. and then try t... [21:38:32] RESOLVED: ProbeDown: Service sessionstore1005-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore1005-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:39:44] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:40:41] (03PS1) 10Scott French: aptrepo: add pcre2-php83-bullseye to Update list [puppet] - 10https://gerrit.wikimedia.org/r/1165606 (https://phabricator.wikimedia.org/T398245) [21:40:41] (03CR) 10Scott French: "Alas, I missed this on the first go. Thanks in advance for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1165606 (https://phabricator.wikimedia.org/T398245) (owner: 10Scott French) [21:42:02] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:45:20] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:46:51] Getting started! [21:48:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 (owner: 10Bernard Wang) [21:48:25] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:48:34] We're doing a config deploy followed by a backport [21:48:42] Hopefully things are nice and smooth and speedy [21:49:01] I was listening to Monaco by Bad Bunny earlier [21:49:04] still hits imo [21:49:11] (03Merged) 10jenkins-bot: Enable mobile search recommendations in all eligible wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165549 (owner: 10Bernard Wang) [21:49:34] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1165549|Enable mobile search recommendations in all eligible wikis except enwiki]] [21:50:23] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:51:04] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:51:39] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:51:41] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:51:41] !log toyofuku@deploy1003 toyofuku, bwang: Backport for [[gerrit:1165549|Enable mobile search recommendations in all eligible wikis except enwiki]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:54:13] !log toyofuku@deploy1003 toyofuku, bwang: Continuing with sync [21:55:18] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [21:55:18] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ganeti1053.eqiad.wmnet with OS bookworm [21:56:13] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [21:56:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10965564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm [21:58:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:59:04] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1054.eqiad.wmnet with OS bookworm [21:59:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10965568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm [21:59:44] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165549|Enable mobile search recommendations in all eligible wikis except enwiki]] (duration: 10m 10s) [22:00:00] One more to go [22:01:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1165585 (owner: 10Bernard Wang) [22:01:26] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on miscweb1003.eqiad.wmnet with reason: decom [22:02:10] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on miscweb2003.codfw.wmnet with reason: decom [22:03:52] (03CR) 10Dzahn: [V:03+1 C:03+2] "I double checked one last time there were no more hits in any apache logs. I had been waiting until ITS confirmed the RT dumps are in back" [puppet] - 10https://gerrit.wikimedia.org/r/1159564 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [22:04:20] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [22:05:19] (03Merged) 10jenkins-bot: Update mobile search overlay temporary input styles [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1165585 (owner: 10Bernard Wang) [22:05:46] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1165585|Update mobile search overlay temporary input styles]] [22:07:52] !log toyofuku@deploy1003 bwang, toyofuku: Backport for [[gerrit:1165585|Update mobile search overlay temporary input styles]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:08:03] Coordinating testing [22:08:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:09:22] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [22:10:01] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [22:10:03] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [22:14:04] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:15:30] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10965587 (10Jclark-ctr) [22:17:06] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10965588 (10Jclark-ctr) @MatthewVernon I have racked the two remaining servers. D7 was helpful. C4 is a 1G rack, so that didn’t alleviate much, but I was... [22:17:43] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns ms-be1092,934 - jclark@cumin1002" [22:17:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns ms-be1092,934 - jclark@cumin1002" [22:17:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:18:15] !log dzahn@cumin1002 START - Cookbook sre.hosts.decommission for hosts miscweb2003.codfw.wmnet [22:19:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1093.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:19:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1092.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:22:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:23:32] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:23:45] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [22:26:24] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [22:28:04] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: miscweb2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1002" [22:28:22] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: miscweb2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1002" [22:28:22] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:28:22] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts miscweb2003.codfw.wmnet [22:28:58] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:30:19] !log toyofuku@deploy1003 bwang, toyofuku: Continuing with sync [22:31:12] !log dzahn@cumin1002 START - Cookbook sre.hosts.decommission for hosts miscweb1003.eqiad.wmnet [22:34:19] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [22:34:49] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [22:35:00] jhancock@cumin1003 provision (PID 4073878) is awaiting input [22:35:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1092.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:35:43] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165585|Update mobile search overlay temporary input styles]] (duration: 29m 56s) [22:36:12] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [22:37:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10965603 (10VRiley-WMF) [22:38:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10965604 (10VRiley-WMF) 05Open→03Resolved This is now completed. Thanks to @Jhancock.wm and @Jclark-ctr for helping with this. [22:39:30] jclark@cumin1002 provision (PID 3744008) is awaiting input [22:39:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10965611 (10Jclark-ctr) @VRiley-WMF I only created OS VD and imaged servers you did finish creating other virtual drives correct? [22:41:39] dzahn@cumin1002 decommission (PID 3746912) is awaiting input [22:41:42] jouncebot: nowandnext [22:41:42] For the next 21 hour(s) and 48 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430) [22:41:42] In 7 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0600) [22:43:13] (03CR) 10Zabe: [C:03+2] categorylinks: Set group0 to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165540 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [22:44:03] (03Merged) 10jenkins-bot: categorylinks: Set group0 to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165540 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [22:44:13] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [22:44:34] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [22:44:36] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1165540|categorylinks: Set group0 to read new (T397912)]] [22:44:39] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [22:45:06] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:45:06] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts miscweb1003.eqiad.wmnet [22:46:46] !log zabe@deploy1003 zabe: Backport for [[gerrit:1165540|categorylinks: Set group0 to read new (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:47:45] !log zabe@deploy1003 zabe: Continuing with sync [22:47:49] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [22:48:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1093.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:48:47] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [22:49:07] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [22:52:30] jclark@cumin1002 reimage (PID 3748369) is awaiting input [22:53:17] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165540|categorylinks: Set group0 to read new (T397912)]] (duration: 08m 40s) [22:53:20] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [22:54:04] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:54:15] (03PS1) 10Zabe: Revert "categorylinks: Set group0 to read new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165615 (https://phabricator.wikimedia.org/T397912) [22:54:24] (03CR) 10Zabe: [V:03+2 C:03+2] Revert "categorylinks: Set group0 to read new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165615 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [22:54:29] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sync - dzahn@cumin1002" [22:54:50] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sync - dzahn@cumin1002" [22:54:50] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:54:55] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1165615|Revert "categorylinks: Set group0 to read new" (T397912 T398380)]] [22:54:59] T398380: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_target_id' in 'ON'Function: MediaWiki\Extension\CategoryTree\CategoryTree::renderChildrenQuery: SELECT page_id,page_namespace,page_title,page_is_redirect,page_len, - https://phabricator.wikimedia.org/T398380 [22:55:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10965687 (10VRiley-WMF) @Jclark-ctr That's correct, yes. [22:56:06] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1092.eqiad.wmnet with OS bullseye [22:56:16] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10965688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-be1092.eqiad.wmnet with OS bullseye [22:56:34] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10965689 (10Jhancock.wm) got a pass. but... ` Testing Redfish API connection to cp2043 (10.193.1.85) [IDRAC.2.11.SYS057] Exporting Server Configuration Profile. [1/30,... [22:57:10] !log zabe@deploy1003 zabe: Backport for [[gerrit:1165615|Revert "categorylinks: Set group0 to read new" (T397912 T398380)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:57:53] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [22:57:54] !log zabe@deploy1003 zabe: Continuing with sync [22:58:08] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10965692 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cp2043.codfw.wmnet with OS bullseye [22:58:11] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2043.codfw.wmnet with OS bullseye [22:58:21] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10965693 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cp2043.codfw.wmnet with OS bullseye executed with errors: - c... [23:01:38] (03PS1) 10Dzahn: remove legacy miscweb VM service names [dns] - 10https://gerrit.wikimedia.org/r/1165616 (https://phabricator.wikimedia.org/T397080) [23:03:45] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165615|Revert "categorylinks: Set group0 to read new" (T397912 T398380)]] (duration: 08m 49s) [23:03:48] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [23:03:49] T398380: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_target_id' in 'ON'Function: MediaWiki\Extension\CategoryTree\CategoryTree::renderChildrenQuery: SELECT page_id,page_namespace,page_title,page_is_redirect,page_len, - https://phabricator.wikimedia.org/T398380 [23:04:33] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10965714 (10Jhancock.wm) that didn't take long. same issue i saw on sretest2006 ` Exception raised while executing cookbook sre.hosts.reimage: Traceback (most recent c... [23:08:31] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1093.eqiad.wmnet with OS bullseye [23:08:42] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10965716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-be1093.eqiad.wmnet with OS bullseye [23:12:39] FIRING: TransitBGPDown: Transit BGP session down between cr2-esams and KPN (2001:67c:24f0:cfe0::1:2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr2-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=KPN - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:15:35] (03PS1) 10Dzahn: peopleweb: temporarily add miscweb monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/1165617 (https://phabricator.wikimedia.org/T397080) [23:16:27] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1053.eqiad.wmnet with OS bookworm [23:16:33] (03CR) 10Dzahn: [C:03+2] peopleweb: temporarily add miscweb monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/1165617 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [23:16:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10965726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed... [23:17:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.122) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:19:21] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1054.eqiad.wmnet with OS bookworm [23:19:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10965744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm executed... [23:22:18] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1092.eqiad.wmnet with reason: host reimage [23:26:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1092.eqiad.wmnet with reason: host reimage [23:27:19] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1093.eqiad.wmnet with reason: host reimage [23:31:05] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10965756 (10Jclark-ctr) [23:31:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1093.eqiad.wmnet with reason: host reimage [23:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1165622 [23:38:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1165622 (owner: 10TrainBranchBot) [23:39:57] jhathaway@cumin2002 reimage (PID 3827832) is awaiting input [23:46:28] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 84351MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [23:52:56] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1165622 (owner: 10TrainBranchBot) [23:53:21] (03CR) 10Dzahn: [C:03+2] "confirmed the checks are still visible under https://thanos.wikimedia.org/targets" [puppet] - 10https://gerrit.wikimedia.org/r/1165617 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)